H3C AMD G6 Servers RAS Technology White Paper-6W101

HomeSupportTechnology LiteratureTechnology White PapersH3C AMD G6 Servers RAS Technology White Paper-6W101
H3C AMD G6 Servers RAS Technology White Paper-6W101
Table of Contents
Related Documents
book
Title Size Download
book 1.14 MB

 

 

H3C AMD G6 Servers RAS

Technology White Paper

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copyright © 2024 New H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.


Contents

Overview·· 1

Benefits· 1

Applicable products· 2

Using this document 2

RAS system architecture· 3

RAS operating mechanism·· 4

Error detection· 5

Error reporting· 6

Error processing· 7

Memory CE funnel and threshold mechanism·· 8

Memory CE funnel mechanism·· 8

Memory CE threshold· 11

SMI interrupt control 13

Memory CE report method· 13

Reliability· 14

Memory RAS· 14

On-chip ECC/Parity· 14

DRAM ECC· 14

DRAM Error Check and Scrub (ECS) 14

DRAM Writeback Suppression on X4 Writes· 15

DRAM UECC Retry· 15

DRAM Address/Command Parity with Replay· 15

DRAM Write Data CRC with Replay· 15

DRAM Read Data CRC with Replay· 16

DRAM Patrol Scrubber 16

DRAM Redirect Scrubber 16

GMI/xGMI RAS· 17

On-Package Link Errors· 17

Off-Package Link Errors· 17

PCIe RAS· 17

PCIe® End-to-End CRC· 17

Availability· 18

Data Poisoning· 18

Fatal Error Recovery· 18

Machine Overflow Recovery· 19

CPU Watchdog Timers· 19

Decoding Last I/O Addresses Table· 19

DRAM Thermal Throttling· 20

CPU Thermal Throttling· 20

PCIe speed reduction and bandwidth reduction detection· 20

PCIe Downstream Port Containment and Error Disconnect Recovery· 21

PCIe Orderly and Surprise Hotplug· 21

DRAM Memory Tester 21

Automatic Boot-time DIMM MapOut 22

Automatic Boot-time Core Disable· 22

Serviceability· 23

In-band Error Reporting· 23

Machine Check Architecture· 23

Platform First Error Handling· 24

PCIe® Advanced Error Reporting (AER) 24

CXL™ Error Handling· 24

CXL™ Protocol Error Reporting· 25

CXL™ Component Error Reporting· 25

PMIC Error Handling· 25

Machine Check Alert 26

Fatal Error Alert 26

Thermal Throttle Alert 26

MCA Mailbox· 26

Legacy SB-RMI MCA Interface· 27

Reset Reason Mailbox· 27

ACPI Platform Error Interface (APEI) 27

APEI Boot Error Record Table (BERT) 27

APEI Hardware Error Status Table (HEST) 28

APEI Error Injection Table (EINJ) 28

Out of Band Error Monitoring· 28

DRAM Corrected Error Leaky Bucket Counters· 29

Boot Status Indicators· 29

DRAM Post-Package Repair (CPU) 29

MCA Address Translation· 30

MCA FruText and DIMM FRU Identification· 30

Platform RAS features· 31

ABL RAS· 31

Error Injection· 31

IPMI command classification for RAS reporting· 31

SMI storm suppression· 33

SWITCH RAS· 33

 


Overview

The server is one of the key components of any modern data center infrastructure. It includes various components such as processors, storage devices, PCIe devices, power supplies, and fans. To ensure service continuity, correct server operation based on data integrity are critical to a modern data center. In other words, we must avoid data corruption no matter data is stored in any server component (memory, cache, or processor registers) or transmitted through any platform links (xGMI or PCI Express).

When a server component fails, the set of reliability, availability and serviceability (RAS) features can meet the above requirements by maximizing service availability and maintaining data integrity.

Table 1 RAS definition based on H3C G6 servers

Item

Definition

Reliability

Probability that the system produces the correct output within a given time T, as measured by the mean time between failures (MTBF) metric.

It can be enhanced by avoiding, detecting, and repairing hardware failures. A reliable system does not provide incorrect data and calculation results, but can detect and correct data corruption.

Availability

Probability of correct system running at a given time, and the percentage of the actual running time of the server to the actual running time.

Serviceability

How easy and fast the system can be repaired or maintained.

If the time for the system to repair errors increases, the serviceability decreases.

Serviceability can be improved by simplifying system issue diagnosis and providing clear and intelligent advance warnings of failures to avoid system failures.

 

Classify errors from high to low error severity into uncorrectable errors, delayed errors, correctable errors and transparent errors.

Table 2 Error categories

Item

Definition

Uncorrectable errors

Errors that cannot be corrected by hardware are reported to software through Machine Check Exception (MCE).

Delayed errors

Errors that cannot be corrected by hardware but does not result in immediate interruption of program progress, loss of data integrity, or corruption of processor state. Delayed errors indicate that the data is poisoned but not used, such as poisoned data.

Correctable errors

Correctable errors can be fixed by hardware and do not lead to data loss or processor status damage.

Transparent errors

Errors that has been corrected by the hardware, does not result in data loss or processor state corruption, and the hardware is capable of correcting the next error in the same device. For example, a single-bit error occurs in a cache with double-bit error correction capability.

 

Benefits

RAS can provide the following benefits:

·     Increased system uptime—Increases system reliability for the system to stay longer, as measured by the Mean Time Between Failures (MTBF), Annual Crash Rate (ACR), or Annual Service Rate (ASR) metric.

·     Reduced the duration of unexpected downtimeAdopts AMD EPYC processors to support synchronized hardware and firmware logging, helping users to identify and isolate errors and take preventive or proactive maintenance measures. This enables quick system restoration, reduces the cost of repairs, and mitigates the consequences of the outage to the business.

Outages are inevitable even with the best plans and processes. When an unplanned outage happens, a maintainable system can come back online quickly, as measure by the Mean Repair Time (MTTR) matric.

·     Enhanced data integrityRAS provides several mechanisms to prevent data corruption or correct poisoned data, which ensures data corruption can get contained once detected.

Applicable products

This document is applicable to the following H3C UniServer servers:

·     H3C UniServer R3950 G6

·     H3C UniServer R4950 G6

·     H3C UniServer R5350 G6

·     H3C UniServer R5500 G6 AMD

Using this document

The information in this document is subject to change over time.

The information in this document might differ from your product if it contains custom configuration options or features.


RAS system architecture

Based on AMD RAS architecture, H3C provides a complete fault management system together with hardware, the BIOS, HDM, and OS error processing mechanisms. The system can provide functions such as error diagnosis, error location, error correction, information collection, and error reporting. Since the core of the system runs on the BIOS and HDM, it does not rely on the OS and can perform all-time detection of the system and take corresponding measures once an error occurs.

As shown in Figure 1, the fault management system contains the hardware layer, CPLD, processor platform, HDM (out-of-band management), the BIOS, and OS.

·     HDMCore of the error location system. It is responsible for error information collection and analysis and can display error information as event logs or alarms from the Web interface.

·     Processor platformSupported by AMD EPYC processors, which provides more powerful management of errors occurred on processors, memory modules, and PCIe devices.

·     CPLDConnects downlink hardware modules, including power supplies, fans, and other underlying hardware (except processors, memory modules, drives, and standard PCIe modules), captures hardware exceptions, connects to HDM at the uplink, and transmits error information.

·     BIOSCollects and locates errors occurred on processors, memory modules, PCIe devices, and storage devices, provides error location results to HDM, and provides OS-level error management interfaces, such as APEI, to the OS.

·     Web interfaceWeb interface provided by management tools, such as HDM, for users to maintain the server locally or remotely. Users can use the Web interface together with LEDs of specific server components to manage the server.

·     Involved protocolsProtocols used by the fault management system includes eSPI, SPI, PCIe, UART, I2C, SMBUS, and LocalBus.

Figure 1 H3C fault management system architecture

 


RAS operating mechanism

The basic error processing schemes of RAS are as follows:

·     For transparent errors, no further action is required at the upper layers, as the hardware has already corrected the errors.

·     For correctable errors, RAS marks the error location and fast repair the corresponding module. Users will not aware the occurrence of such errors.

·     For delayed errors, the location of toxic data will be immediately marked, and by limiting the spread of poisoned data, delayed errors will be prevented from evolving into more serious problems.

·     For uncorrectable errors, RAS isolates the errors by isolating bad memory blocks or degrading the bus to maintain system operation. If severe errors occur and result in system outage, you must use HDM to restore or restart the system.

RAS technology for AMD G6 servers is realized based on the following mechanisms:

·     Maintain reliable operation through robust components and maximum error detection.

·     Avoid system failures by reducing global errors, error correction, and continuing operation despite uncorrectable errors.

·     Avoid system downtime (mainly for planned downtime) through the following methods:

¡     First-time error diagnosis: Try to capture enough data to complete the error diagnosis when the mistake occurs for the first time.

¡     Internal redundancy design and online maintenance capabilities: Allows for continuous operation without replacing parts or problem-solving without shutting down. For more information about redundancy design and online maintenance, see the detailed introduction.

·     Flexibility: The choice of whether to enable a function is primarily implemented in the firmware, not in the hardware.

The new and improved RAS features in the AMD Genoa/G6 are as follows:

·     Reliability improvement:

¡     Support Advanced Memory Device Correction (AMDC).

¡     Support DRAM Error Check and Scrub (ECS).

¡     Support DRAM read and write data CRC.

¡     Support PCIe/CXL LCRC + replay.

¡     Support PCIe/CXL.io ECRC.

¡     Support PCIe/CXL.io uncorrected error detection.

·     Availability improvement:

¡     Support on-chip watchdog timers.

¡     Support PCIe system firmware intermediary.

·     Maintainability improvement (diagnosis and repair):

¡     Support in-band error reporting.

¡     Support memory error reporting.

¡     Support out-of-band error monitoring.

¡     Support DRAM runtime post-package repair.

¡     Support error injection.

¡     Support error injection on secure silicon.

·     Added features:

¡     DRAM Corrected Error Leaky Bucket Counters.

¡     Advanced Platform Management Link interface (APML).

Figure 2 RAS schemes

 

The following introduces the key points of AMD RAS technology according to the processes of error detection, reporting, and processing.

Error detection

Error detection of AMD G6 series servers is mainly achieved through ECC, Parity and CRC technologies in key components and areas. Errors are mainly divided into data errors, bus errors and logical function errors. The corresponding common methods of error detection are as follows:

·     The general data module uses ECC and Parity for detection.

·     The bus interface module uses the CRC for detection.

·     The logic functional module uses the Timeout for detection.

Error correction is generally achieved through ECC and Retry. The main data and bus module's error detection and error correcting capability are shown in Table 3.

Table 3 Error detection methods for different modules

Module

Detection method

CPU Core

Parity

L1 Data Cache

ECC

L1 Data Tag

ECC

L1 Instruction TLB

Parity + Retry

L1 Instruction Cache

Parity + Retry

L1 Instruction Tag

Parity + Retry

L2/L3 Cache

DEC-TED ECC

L2/L3 Cache Tag

SEC-DED ECC

DRAM Address/Command

Parity + Replay

DRAM Write Data

CRC + Replay

Memory Controller

SEC-DED ECC

Memory Controller DF

Parity

NBIO, PCIE and NBIF

Parity + ECC

SATA

Parity

USB

ECC

FCH A-Link

Parity

On-Chip Data Bus

Parity

Off-Package Link Packet

Parity + Retry

On-package and off-package PHY Controller

ECC

System Probe Filter

ECC

System Management Network

Parity

SMN Off-Package Link Packet

CRC + Retry

SMN On-Package Link Packet

CRC + Retry

SMU

Parity + ECC

MP5

Parity + ECC

PSP

Parity + ECC

Parameter Block

ECC

 

The most error-prone part of the CPU is the data storage module in the CPU, that is, the cache at all Layers. The AMD G6 server series offers comprehensive cache protection mechanisms, including ECC for Layer 1 data cache, parity check for Layer 1 instruction cache, and double-error correction, triple error detection (DEC-TED) for Layer 2 and Layer 3 cache. Compared to traditional single-error correction and double error detection (SEC-DED) ECC, it adds an additional correction layer. EC-TED reduces the impact of multi-bit transient errors by correcting double-bit errors and also reduces the impact of single-bit hard faults. EC-TED is able to correct a second single-bit error in the affected part of the cache.

In addition to ECC, the memory also supports Patrol Scrubber, Redirect Scrubber and Poison Scrubber. Some memory locations might not be accessed for a long time. The Patrol Scrubber's periodic memory inspection function periodically traverses all memory zones, correcting correctable mistakes. It's a good, low-cost method for checking memory and enhancing data integrity. Redirect Scrubber refers to the operation of correcting the error and writing the correct data back to the memory if a correctable error is found when the CPU instruction actively reads the memory data. The AMD G6 series server also supports Poison Scrubber, which marks the location of poison data to prevent reporting unnecessary delay errors.

Error reporting

The system reports detected errors and generates error log messages. The AMD G6 series server supports four types of error reporting methods: MCA, MCAX, AER, and AMPL.

MCA records processor and system hardware errors for reporting to the operating system. When an error is detected, MCA will save information about the error in specific registers and interrupt error reporting of the specified type. For hardware correctable errors, the software logs error information for subsequent diagnosis and repair. If the error cannot be corrected, the error is classified as an uncorrectable error. When an uncorrectable error occurs, the system immediately takes steps to correct the error and resume the interrupted program. If the error cannot be corrected by software, MCA determines the extent of the impact of the uncorrected error on the execution instruction flow and the architectural state of the processor or system, and then takes action to control the impact of the failure by terminating the damaged software process.

For errors that are not corrected and have no direct impact on the system, processor kernel, or the schema state of any currently executing thread, the hardware (HW) may classify them as delayed errors. Information about latency errors will be logged but not reported via MCE. Instead, the hardware continuously monitors for errors and upgrades them to uncorrectable errors when an error condition is about to affect the execution of the instruction flow or cause corruption of the processor core or system architectural state. At this point, if the reporting for this mistake source is enabled, it will trigger a disrupt in MCE to report the error. If the error can be corrected at the system level, it may be possible to restore the affected programs. If the error cannot be corrected, the operating system can terminate the affected program without shutting down the entire system.

The processor also supports Machine Check Architecture Extensions (MCAX), which is AMD's x86-64 expansion of MCA. It provides a richer MCA bank than traditional MCA and provides bank ownership all the way to a single kernel. Since each bank is owned by a single kernel, the MCA register can only be accessed from the MCA main thread, which helps ensure that interrupts from the machine checking the bank are routed to the appropriate thread. The MCA X architecture supports up to 256 MCA groups, each with 16 memory registers. The expanded MCA architecture holds more information about system health, which can be queried to speed troubleshooting.

Advanced Error Reporting (AER) is mainly used to report and handle PCIe errors. AER can locate the error source in the PCIe architecture, provides a standardized control mechanism for error messages received by PCIe Root through interrupt reporting, and can distinguish the severity of various uncorrectable errors.

Advanced Platform Management Link (APML) follows the SMBus protocol, also known as Sideband Interface (SBI). APML provides an interface for out-of-band access to MCA registers. You can use the APML interface to communicate with the server's out-of-band management system HDM so that error information can be directly obtained by HDM

Error processing

If a mistake is detected but cannot be corrected, the impact of the error is minimized through data poisoning marking or fault tolerance on the link.

Data poisoning is a mechanism that enables machine check recovery by transforming global uncorrected errors into local uncorrected errors, reducing the frequency of system interrupts. Data poisoning involves detecting, marking, and tracing uncorrectable data errors. As poisoned data moves through the system, the poisoned state is retained in memory, cache, and links so that the poisoned data is not used by the system. When marked by data poisoning, uncorrectable data errors can lead to fatal conditions for the system. Through data poisoning, uncorrectable data errors can be contained within the process context. When the system terminates the affected process, other processes and the system itself are not affected. The system remains available despite data errors. When the CPU attempts to use data marked as poison, poisoning consumption occurs. The poisoning consumption passes the Machine Check Anomaly (MCE) in the context of the consumption process or task, enabling the MCA to identify the process consuming data and proceed with restoration or termination. When the processor detects poisoned data, it logs a delayed error in the MCA bank to identify the location of the uncorrectable data error, helping determine if hardware failure has occurred and reconfigure the system to prevent the use of faulty hardware.

In terms of memory, AMD G6 series servers support the JEDEC-defined Post Package Repair (PPR) feature, which supports configuring spare DRAM rows to replace faulty ones. The combination of hardware and firmware support from the processor enables both soft (reconfigurable) and hard (permanent) repair, ensuring that DIMMs with problematic DRAM rows can be reconfigured and maintain the same level of reliability as before the issue occurred.

In terms of PCIe, the PCIe bus is protected by AER and EDPC, which can help isolate and recover from errors on the PCIe link. When an uncorrectable error occurs on a PCIe root port, the EDPC feature can recover by disconnecting and reconnecting the affected PCIe link. To ensure uninterrupted operation of devices under the OS, the PCIe root port saves the context of the PCIe endpoint device before disconnecting the link. After the link is restored, the context information of the PCIe endpoint is accurately recovered to ensure the system operates unaffected.

Memory CE funnel and threshold mechanism

Memory CE funnel mechanism

The memory CE funnel mechanism is designed to manage and control Correctable Error (CE) events on the AMD G6 platform. By reducing the value by one within a set time and discarding the value, memory CE funnel controls the frequency of overall error event reporting. The AMD Genoa platform provides the following features:

·     Funnel enablement register: Used to enable or disable the funnel feature. The disabling state is equivalent to the faucet being closed in Figure 3.

·     Speed register: Used to set the flow rate, which is equivalent to the degree to which the faucet is opened or closed as shown in Figure 3.

Figure 3 Memory CE funnel mechanism

 

Selecting the LeakMode or NoLeakMode mode

Set the DRAM Corrected Error Counter Enable value from the BIOS Setup Utility. Options include:

·     LeakModeEnable the funnel feature.

·     NoLeakModeDisable the funnel feature.

Figure 4 Selecting the LeakMode or NoLeakMode mode

 

Setting the Leak Rate

Set the DRAM Corrected Error Counter Leak Rate value from the BIOS Setup Utility.

Figure 5 Setting the Leak Rate

 

Figure 6 shows available values of the leak rate. The default value is 07h, which represents leak one memory CE every 10.24 us.

Based on experience, it is generally recommended to discard one CE per second. Therefore, you can select the closest available option, which is 18h, meaning one is discarded every 1.34 seconds.

Figure 6 Available values of the leak rate

 

Memory CE threshold

The memory CEs are accumulated by a counter. The counter starts at the Start Count value and increases by 1 every time a CE occurs. When the counter value reaches 0xFFFF, the counter overflows, which triggers an SMI interrupt. The BIOS starts to execute the interrupt.

Figure 7 Memory CE threshold

 

Setting the Start Count for the memory CE counter

Set the DRAM Corrected Error Counter Start Count value from the BIOS Setup Utility to specify the start value of the counter.

Figure 8 Setting the Start Count value for the memory CE counter

 

When the counter value reaches 0xFFFF, it triggers an SMI interrupt. The BIOS starts to execute the interrupt.

 

 

From this, it can be determined that the memory CE threshold is the value between Start Count and 0xFFFF. The threshold is calculated as follows:

For example, if the target threshold is 5000:

Start Count = 0xFFFF – 5000 = 0xEC77

Therefore, set the BIOS Setup option value to 0xEC77.

Setting the Memory CE Report Count Threshold

You can choose to report errors based on the frequency of persistent memory CEs. When the set threshold is met, SMI interrupts are masked to prevent memory CE reporting, enhancing system reliability and accuracy.

The default value of the Memory CE Report Count Threshold option on the BIOS Setup Utility is 1.

SMI interrupt control

Setting the MCA error threshold

You can set the threshold for correctable errors in the Machine Check Architecture (MCA). When the count reaches 0xFFF, it will trigger an SMI interrupt.

Configuring SMI storm suppression

To prevent performance impacts from excessive SMI interrupts, the AMD platform offers an SMI storm suppression mechanism. When Machine Check Exception (MCE) frequency exceeds 5 times per second, the system enters SMI Polling mode, handling MCE events at a rate of 5 times per second. Once the frequency drops below 5 times per second, the system reverts to SMI Interrupt mode.

Memory CE report method

Memory CE reporting to BMC

The BIOS reports memory CE information through the IPMI interface, which includes three formats: SEL log, SDS log, and OEM IPMI commands.

Memory CE reporting to OS

The BIOS reports error records to the ACPI Table using APEI-HEST-GHES. You can view these records in the OS using the dmesg command.


Reliability

Memory RAS

On-chip ECC/Parity

Feature name

On-chip ECC/Parity

Description

The processor protects critical on-chip SRAM, triggers, and latch arrays using ECC or parity check. The array can prevent multi-bit errors through interleaving.

Purpose

Improve system reliability and accuracy.

Configuration

Disabled by default and can be enabled from the BIOS.

Remarks

N/A

 

DRAM ECC

Feature name

DRAM ECC

Description

The memory controller of the processor supports two different error correction codes. Both DRAM error correction codes use symbol-based coding to form ECC. x4 ECC uses 36 4-bit symbols to generate a 144-bit ECC containing 128 data bits and 16 parity bits. x16 ECC uses 18 16-bit symbols to generate a 288-bit ECC containing 256 data bits and 32 parity bits.

When x4 DRAM devices are used on an 80-bit channel, x16 ECC with Advanced Memory Device Correction (AMDC) can correct all errors caused by a single failed DRAM device.

x16 ECC is compatible with the DDR5 limited-failure specification, providing better correction performance when using x8 DRAM devices on an 80-bit channel or x4 DRAM devices on a 72-bit channel.

Purpose

Improve system reliability and accuracy.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Error Check and Scrub (ECS)

Feature name

DRAM Error Check and Scrub (ECS)

Description

The memory controller periodically corrects single-bit errors detected by ECC within the chip. This feature must be enabled in the memory controller. When a DRAM device exceeds a specific error threshold, the MCA of the memory controller records the error channel, device, bank, and row.

Purpose

Improve system reliability and accuracy.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Writeback Suppression on X4 Writes

Feature name

DRAM Writeback Suppression on X4 Writes

Description

DRAM provides a method to suppress data write-back during a read-modify-write (RMW) operation when writing to X4.

Purpose

X4 device error correction write-back inhibition.

Configuration

Disabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM UECC Retry

Feature name

DRAM UECC Retry

Description

If an uncorrected ECC error occurs, the processor can retry the command, providing recovery for transient errors on the data bus. If retrying doesn't result in an ECC error or corrects it, data is forwarded and the corrected error is recorded. If the maximum retry count is reached and the uncorrected ECC error persists, the data is poisoned, and the delayed error is recorded.

Purpose

Improve system reliability and accuracy.

Configuration

Disabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Address/Command Parity with Replay

Feature name

DRAM Address/Command Parity with Replay

Description

Based on the JEDEC DDR5 standard, the processor implements RCD parity check on the DDR5 address or command bus. If a parity check error occurs, the command can be replayed. This feature provides detection and recovery for transient errors on the bus.

Purpose

Improve system reliability and accuracy.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Write Data CRC with Replay

Feature name

DRAM Write Data CRC with Replay

Description

According to the JEDEC DDR5 standard, the processor performs cyclic redundancy check (CRC) on the write data packets.

If a CRC error occurs, the command can be replayed. This function provides detection and recovery for transient errors on the bus.

Purpose

Improve system reliability and accuracy.

Configuration

Disabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Read Data CRC with Replay

Feature name

DRAM Read Data CRC with Replay

Description

According to the JEDEC DDR5 standard, the processor performs cyclic redundancy check (CRC) on the read data packets.

If a CRC error occurs, the command can be replayed. This function provides detection and recovery for transient errors on the bus.

Errors not corrected through replay will result in data poisoning and record delayed errors.

Purpose

Improve system reliability and accuracy.

Configuration

Disabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Patrol Scrubber

Feature name

DRAM Patrol Scrubber

Description

DRAM performs read-modify-write operations periodically on memory locations to detect and correct potential errors. This feature is awakened once per cycle and checks the next sequential position in memory.

The processor can configure this feature to simultaneously erase multiple address ranges, improving erase speed on systems with high memory occupancy.

Purpose

Improve system reliability and accuracy.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Redirect Scrubber

Feature name

DRAM Patrol Scrubber

Description

When a read request encounters a correctable ECC error, the DRAM redirect handling feature will be invoked as necessary. This feature immediately writes the corrected data back to memory.

Purpose

Improve system reliability and accuracy.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

GMI/xGMI RAS

On-Package Link Errors

Feature name

On-Package Link Errors

Description

CRC errors might occur in the GMI/xGMI encapsulation. When correctable CRC errors exceed a threshold, they are recorded in MCA to indicate system performance degradation. Uncorrectable errors in the link will also be logged in MCA, resulting in a system fatal error event.

Purpose

Improve the reliability of GMI/xGMI link data transmission.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

Off-Package Link Errors

Feature name

Off-Package Link Errors

Description

The external GMI/xGMI link encapsulation might encounter CRC errors. When correctable CRC errors exceed a threshold, they are recorded in MCA to indicate system performance degradation. Uncorrectable errors in the link will also be logged in MCA, resulting in a system fatal error event.

Purpose

Improve the reliability of GMI/xGMI link data transmission.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

PCIe RAS

PCIe® End-to-End CRC

Feature name

PCIe® End-to-End CRC

Description

The processor supports generating PCIe ECRC and checking all PCIe root ports. For peer-to-peer requests passing through the processor, ECRC is not preserved. Typically, ECRC generation should only be enabled in the root ports and endpoints when there are switches between them.

Purpose

Achieve higher data integrity when exchanging data between two PCIe endpoints.

Configuration

Enabled by default.

Remarks

N/A

 


Availability

Data Poisoning

Feature name

Data Poisoning

Description

When a system encounters a poison, it will attempt to mark it. The marking process generates poison consumption, which in turn results in reporting to the MCA during the consumption process in the context.

Purpose

Improve system reliability.

Configuration

Automatically takes effect and cannot be disabled.

Remarks

N/A

 

Fatal Error Recovery

Feature name

Fatal Error Recovery

Description

When the CPU detects an unmarked, unrecoverable error, it acts to contain data corruption, issuing a system-level fatal error event to the internal Data Fabric module. This event freezes the Data Fabric's data transmission queues to prevent bad data propagation to non-volatile storage.

In the event of a system fatal error, the CPU supports system recovery through a warm reset and the option for an out-of-band MCA information collection in case of a system hang during the warm reset. If warm reset fails, the CPU initiates a cold restart, also supporting the triggering of a system hang before the cold restart.

To facilitate debugging, the processor logs all fatal error events in the NBIO MCA bank. AMD recommends platforms to shield fatal errors in the NBIO MCA, as fatal errors will still be recorded in the original block's MCA (or MCAX).

Purpose

Improve system reliability.

Configuration

The Reset after Sync-Flood menu controls whether a hot reset is executed when a fatal error occurs. When this menu is set to TRUE, a fatal error triggers a hot reset. When set to FALSE, a fatal error will not trigger a hot reset.

Remarks

N/A

 

Machine Overflow Recovery

Feature name

Machine Overflow Recovery

Description

Machine Overflow Recovery allows system recovery in overflow situations. When MCi_STATUS[Overflow] records a non-fatal error, Machine Overflow Recovery is supported. If a system-critical error occurs, Machine Overflow Recovery is not allowed. When Machine Overflow Recovery is supported, software records all system critical situations via MCA_STATUS[PCC]. If Machine Overflow Recovery is not supported, unrecoverable errors recorded in MCA_STATUS[Overflow] will not be logged as recovery errors.

Purpose

Improve system reliability.

Configuration

Automatically takes effect and cannot be disabled.

Remarks

N/A

 

CPU Watchdog Timers

Feature name

CPU Watchdog Timers

Description

The CPU watchdog timer (WDT) is used to detect situations where the x86 kernel cannot make forward progress and to recover from them. It is a configurable counter that is reset upon completion of each instruction operation. If no operation is completed within a specified time, a system fatal error event is generated.

Purpose

Use a watchdog timer to track execution progress and detect when the processor is unable to continue forward execution.

Configuration

Enabled by default.

Remarks

N/A

 

Decoding Last I/O Addresses Table

Feature name

Decoding Last I/O Addresses Table

Description

When a fatal error occurs, diagnostic software records the Last I/O address to DF::OrigWdtAddrLogLo and DF::OrigWdtAddrLogHi. The address range 0 to FFFFFFFC_FFFFFFFF corresponds to DRAM or MMIO, and the specific physical DRAM location can be determined further through the OS's memmap. The range FFFFF_10000000 to FFFFF_1FFFFFF corresponds to a PCIe device, where bits 27-20 represent the bus, bits 19-15 represent the device, and bits 14-12 represent the function, allowing identification of the specific PCIe device location. The range FFFFFFD_FC000000 to FFFFFFD_FC00FFFF corresponds to PCIe I/O.

Purpose

Identify whether the fatal error is caused by an external component through the decoded address.

Configuration

Enabled by default and collected through ADDC.

Remarks

N/A

 

DRAM Thermal Throttling

Feature name

DRAM Thermal Throttling

Description

When approaching DIMM temperature limits, the processor supports thermal throttling. It increases refresh rates to maintain data integrity and imposes bandwidth limits on the command bus. The processor also supports firmware-driven bandwidth limits, enabling a response to platform-detected events using APML.

Purpose

Increase availability by avoiding downtime situations due to exceeding temperature limits.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

CPU Thermal Throttling

Feature name

CPU Thermal Throttling

Description

The processor supports thermal throttling when approaching temperature limits. Increase availability by avoiding downtime situations due to exceeding temperature limits.

Purpose

Improve system availability and maintainability.

Configuration

Enabled by default and partially configurable.

Remarks

N/A

 

PCIe speed reduction and bandwidth reduction detection

Feature name

PCIe speed reduction and bandwidth reduction detection

Description

The processor supports thermal throttling when approaching temperature limits. Increase availability by avoiding downtime situations due to exceeding temperature limits.

Purpose

Improve system availability and maintainability.

Configuration

Enabled by default and partially configurable.

Remarks

N/A

 

PCIe Downstream Port Containment and Error Disconnect Recovery

Feature name

PCIe Downstream Port Containment and Error Disconnect Recovery

Description

The processor supports the DPC functionality described in the PCIe specification, including RP PIO extension and DL_Active ERR_COR signal. Platform firmware and operating systems coordinate using downstream port containment (DPC) and Surprise Down functionality of the processor to support PCIe link error recovery.

Purpose

Avoid propagation of potentially bad data with error isolation and recovery on PCIe links

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

PCIe Orderly and Surprise Hotplug

Feature name

PCIe Orderly and Surprise Hotplug

Description

Supports hot-swapping of PCIe devices, allowing users to replace faulty devices and restore the server to normal operation when the system is not powered on.

Purpose

Improve system availability.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

DRAM Memory Tester

Feature name

DRAM Memory Tester

Description

The processor offers the capability to test installed memory to check for damaged DIMMs. During the reset sequence, the memory tester writes and reads data to a set of addresses on each installed channel and DIMM in the system. If an uncorrectable ECC error is detected while accessing a DIMM, it is marked as damaged. When a DIMM is marked as damaged, the processor disables it and performs a system hot reset to reinitialize the memory mapping without the DIMM in the system.

Purpose

The memory tester's purpose is not to exhaustively search for all DIMM errors, but to test for completely damaged DIMMs.

Configuration

Enabled by default.

Remarks

N/A

 

Automatic Boot-time DIMM MapOut

Feature name

Automatic Boot-time DIMM MapOut

Description

If one of the two memory tests, PMU training and Agesa Memory Test, fails, the failed memory will be moved out of the Map and disabled.

Purpose

Improve system memory reliability and stability.

Configuration

Enabled by default.

Remarks

N/A

 

Automatic Boot-time Core Disable

Feature name

Automatic Boot-time Core Disable

Description

This processor implements a feature where if the core fails the Built-In Self-Test (BIST), it can be ignored from the active configuration. If a core or cache fails the BIST, the processor will report the faulty core complex and attempt to boot with the minimum number of core complexes as specified in the processor programming reference.

Purpose

Automatically disable cores that have not passed the Built-In Self-Test.

Configuration

Disabled by default.

Remarks

N/A

 


Serviceability

In-band Error Reporting

Machine Check Architecture

Feature name

Legacy x86 Machine Check Architecture (MCA)

Description

The processor implements the x86 Machine Check Architecture (MCA). MCA defines a way to record processor and system hardware errors and report them to system software, allowing system software to play a role in the recovery and diagnosis of hardware errors.

Purpose

Ensure component-level reliability.

Configuration

Automatically takes effect and cannot be disabled.

Remarks

N/A

 

Feature name

AMD Machine CheckArchitecture Extensions(MCAX)

Description

MCAX, an extension of the Machine Check Architecture, provides a richer feature set than the traditional x86 Machine Check Architecture. The extended features include:

·     Expansion of MCA banks: Supports extending the number of MCA banks supported by AMD processors to support comprehensive error logging for many blocks in the processor.

·     Expansion of MCA banks size: Enhanced error handling, improved diagnostics, and finer configuration by recording extended information in MCA Banks.

·     Single-kernel ownership of MCA banks: Each MCA register set is only visible to one core, and no software synchronization is required when accessing MCA set registers.

Purpose

Enhance BIOS control over faults.

Configuration

Automatically takes effect and cannot be disabled.

Remarks

N/A

 

Feature name

Machine Check Architecture (MCA) Thresholding

Description

The error threshold is used to count the number of errors, and the processor implements the x86 MCA threshold through the MCA_MISCx register. Supports setting error threshold via saturation counter, SMI interrupt will be sent on overflow.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

Platform First Error Handling

Feature name

Platform First Error Handling (PFEH)

Description

The processor implements platform-first error handling, enabling all errors logged in the MCA to be reported first to the platform firmware rather than to the operating system or hypervisor. This feature allows the platform firmware to take platform-specific action (for example, repair or log) for each error before notifying the operating system or hypervisor. For example, platform firmware can implement predictive failure analysis to reduce service costs or future downtime.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

PCIe® Advanced Error Reporting (AER)

Feature name

PCIe AER

Description

Each PCIe root port on the processor supports AER, enabling advanced error handling, diagnostics and recovery capabilities for PCIe devices. This feature works on physical root ports that connect external devices and internal root ports that connect internal PCIe devices. The processor supports OS-first and firmware-first reporting of PCIe AER errors.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

CXL™ Error Handling

Feature name

CXL Error Handling

Description

The processor supports CXL protocol version 1.1. For CXL support, the processor provides up to 4 RCs per Package, and each RC supports up to 4 CXL ports. That is, each Package supports up to 16 CXL memory devices. CXL error handling mainly includes device error handling and protocol layer error handling.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

CXL™ Protocol Error Reporting

Feature name

CXL Protocol Error Reporting

Description

CXL protocol fault reporting is completed by the PCIe AER register and the CXL Ras capability structure. Therefore, the strategy used to control CXL protocol fault reporting will be affected by the PCIe AER control strategy. If CXL protocol fault reporting is set to firmware priority, then CXL protocol fault reporting follows the firmware priority principle. If CXL protocol fault reporting is set to PCIe AER reporting, then CXL protocol error reporting follows the PCIe AER fault reporting policy.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

CXL™ Component Error Reporting

Feature name

CXL Component Error Handling

Description

CXL component faults are reported by CXL devices that support the Mailbox interface defined by the CXL Memory device register. For firmware-first fault reporting, each CXL device component that supports CXL component failures provides GHES/CPER structures for fault reporting in the system.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

PMIC Error Handling

Feature name

PMIC Error Handling

Description

On DDR5, the PMIC provides voltage to the DDR core/IO and VPP. According to the JEDEC PMIC specification, PMIC can report faults. The ability of PMIC fault recovery depends on different PMIC manufacturer strategies. In the memory initialization phase, the PMIC register is read to collect fault information. In the firmware execution phase, PMIC faults are reported to BMC and recorded in the BERT table.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default and can be configured from the BIOS.

Remarks

N/A

 

Machine Check Alert

Feature name

Machine Check Alert

Description

When an MCA error occurs, the CPU sends an alert to APML. The CPU will record this alert in the SB-RMI RasStatus register, which can be accessed through APML.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

Fatal Error Alert

Feature name

Fatal Error Alert

Description

When a fatal error occurs, the CPU sends an alert to APML. The CPU will record this alert in the SB-RMI RasStatus register, which can be accessed through APML. At the same time, when a fatal error occurs, the CPU also supports disabling the hot reset action, which can be used to enable MCA collection after a fatal error occurs.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

Thermal Throttle Alert

Feature name

Thermal Throttle Alert

Description

The processor supports thermal throttling event alerts. SB-TSI has high/low thresholds that trigger the ALERT_L signal to the BMC.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

MCA Mailbox

Feature name

MCA Mailbox

Description

The processor supports two BMC Mailbox commands for querying and retrieving any valid MCA: BMC_RAS_MCA_VALIDITY_CHECK and BMC_RAS_MCA_MSR_DUMP. If warm restart response to fatal errors is disabled, these Mailbox commands can be used at runtime or after a fatal error is generated.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

Legacy SB-RMI MCA Interface

Feature name

Legacy SB-RMI MCA Interface

Description

The processor supports the Legacy SB-RMI MCA interface for polling MCA registers. Accessing MCA state through SB-RMI or MSR in-band modes does not conflict. In the event of a severe error, MCA registers cannot be accessed before a reset. As a best practice, use the MCA mailbox method to access MCA instead of this Legacy interface.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

Reset Reason Mailbox

Feature name

Reset Reason Mailbox

Description

FCH provides a register that holds the reason for previous restarts. The S5_Reset_Status register of the FCH contains the cause of the last reset, including hardware-induced fatal errors, x86 triple-fault shutdown events, and push-button and software-triggered resets.

Purpose

Enhance BIOS fault control.

Configuration

Enabled by default.

Remarks

N/A

 

ACPI Platform Error Interface (APEI)

Feature name

APEI

Description

Supports industry-standard Advanced Configuration and Power Interface Platform Error Interface (APEI) version 6.2. AMD supports the Hardware Error Status Table (HEST), Boot Error Recording Table (BERT), and Error Injection Table (EINJ).

Purpose

Improve system availability and maintainability.

Configuration

Enabled by default and cannot be configured.

Remarks

N/A

 

APEI Boot Error Record Table (BERT)

Feature name

APEI Boot Error Record Table (BERT)

Description

All errors logged at startup create an entry in BERT and are divided into general error data entries, memory error entries, memory parity error entries, PCIe error entries and processor error entries according to the error type.

Purpose

Record startup errors and report them to the operating system.

Configuration

Enabled by default.

Remarks

N/A

 

APEI Hardware Error Status Table (HEST)

Feature name

APEI Hardware Error Status Table (HEST)

Description

Supports the ACPI standard HEST table to define hardware-related error sources and error types. Errors reported via standard MCA, MCAX support MCE error sources, CMCI sources and DMC sources (MCAX) or PFEH interfaces.

Purpose

Implementation of standardized software and hardware error interfaces.

Configuration

Enabled by default.

Remarks

N/A

 

APEI Error Injection Table (EINJ)

Feature name

APEI Error Injection Table (EINJ)

Description

AMD supports error injection for type 3, 4, and 5 errors through APEI's EINJ table. The firmware allows separate injection and triggering of each error type.

·     Type 3 error injection injects correctable DRAM ECC errors at the specified address.

·     Type 4 error injection injects uncorrectable DRAM ECC errors at the specified address, to be reported and recorded as deferred errors.

·     Type 5 error injection injects uncorrectable DRAM ECC errors at specified addresses. The triggering action of a Type 5 injection causes the operating system to read the error location and log and report a delay error, and an uncorrectable poison consumption error to be logged and reported via MCA.

Purpose

Support memory ECC error injection and verify related functions.

Configuration

Enabled by default.

Remarks

N/A

 

Out of Band Error Monitoring

Feature name

Out of Band Error Monitoring

Description

The processor supports out-of-band error monitoring through the APML interface. This can be used by a BMC or similar functional entity to monitor errors occurring on the CPU.

Purpose

Enhance system control over failures.

Configuration

Enabled by default.

Remarks

N/A

 

DRAM Corrected Error Leaky Bucket Counters

Feature name

DRAM Corrected Error Leaky Bucket Counters

Description

The processor provides a 16-bit DRAM correction error counter for each chip select, which is incremented when a corrected ECC error is detected. For DIMMs that support column multiplication, the memory controller provides a counter for each column multiplier per chip select. When the counter is saturated, an SMI or APIC-based LVT interrupt can optionally be generated.

Purpose

Improves memory reliability and reduces the likelihood of data corruption or system crashes.

Configuration

Enabled by default.

Remarks

N/A

 

Boot Status Indicators

Feature name

Boot Status Indicators

Description

Boot Status Indicators indicate the status of the boot process. If issues occur during boot, the indicators pinpoint the problem, enabling fault detection.

Purpose

Improved system availability and maintainability.

Configuration

Enabled by default.

Remarks

N/A

 

DRAM Post-Package Repair (CPU)

Feature name

DRAM Post-Package Repair (CPU)

Description

The JEDEC-defined post-package repair (PPR) feature allows a spare DRAM row to be configured to replace a faulty or failed row. The processor provides hardware and firmware support, allowing soft (reconfigurable) and hard (permanent) fixes.

Purpose

Using DDR5 DRAM PPR, DIMMs with problematic DRAM rows can be reconfigured and maintain the same level of reliability as before the problem occurred.

Configuration

Enabled by default.

Remarks

N/A

 

MCA Address Translation

Feature name

MCA Address Translation

Description

When a DRAM ECC error occurs, the Ecc error address recorded by the memory controller in the MCA register is a truncated version (relative to the normalized address). The truncated version address needs to be converted into the system physical address through a specific interface before it can be used directly by the platform firmware or operating system. Similarly, the truncated version address needs to be converted into a DRAM physical address through a specific interface before it can be directly used by the platform firmware to obtain DRAM chip select/row/bank/column and other information.

Purpose

Used to report and record DRAM Ecc errors.

Configuration

The DXE stage uses AMD_RAS_SERVICE_DXE_PROTOCOL to perform MCA Address conversion. The Rumtime stage uses AMD_RAS_SERVICE_SMM_PROTOCOL to perform MCA Address conversion.

Remarks

The platform firmware aligns the translation with the provided interface before performing any processing on the DRAM Address recorded in the MCA register.

 

MCA FruText and DIMM FRU Identification

Feature name

MCA FruText and DIMM FRU Identification

Description

The memory controller records the characteristic information of each Error in the MCA_SYND1_UMC and MCA_SYND2_UMC registers. The CPU supports filling some platform-related FruText information in the MCA error log reported to the operating system into these MCA_SYND1_UMC and MCA_SYND2_UMC registers.

MCA FruTestju encoding example:

Perr Sx:Txx:Bxx:

·     Perr—Processor Error

·     S—Socket#

·     T—Thread#

·     B—Bank#

·     x—hexadecimal number

For example, Perr S0:T0a:B1f.

Purpose

Used to quickly analyze the meaning of MCA Error.

Configuration

Set PcdAmdMcaFruTextEnable to not TRUE.

Remarks

·     For platform firmware:

¡     The platform firmware needs to implement the DIMM FRU recognition Protocol (DXE_RAS_OEM_DIMMMAP_PROTOCOL) to provide the platform-related memory map table.

¡     The platform firmware needs to populate the FruText field values into HEST and BERT tables.

·     For operating system (Linux):

¡     When MCA_CONFIG[FruTextInSynd] is 1, print MCA_SYND1 and MCA_SYND2 as ASCII characters in the MCA error log.

¡     When MCA_CONFIG[FruTextInSynd] is 0, print MCA_SYND1 and MCA_SYND2 as hexadecimal numbers in the MCA error log.

 


Platform RAS features

ABL RAS

Feature name

ABL RAS

Description

During the ABL stage, the processor performs a series of error detection and reporting processes, including:

·     Executing memory initialization.

·     Performing DDR training and recording training results.

·     Conducting MBIST testing and recording test results.

·     Collecting MBIST data in APOB and storing it in memory.

·     The driver retrieves MBIST results and DDR training results.

·     Transmitting error information through Error Codes.

Purpose

Implement error detection and reporting during the ABL stage.

Configuration

Enabled by default and can be configured from BIOS.

Remarks

N/A

 

Error Injection

Feature name

ABL RAS

Description

To assist with software testing and debugging, the processor supports hardware interfaces for fault injection. Supported error types for injection include:

·     On-chip ECC/parity check (NBIO, SMU/PSP/MPIO, UMC, and PCIe).

·     DRAM ECC.

·     DRAM UECC retry.

·     DRAM address/command parity check and replay.

·     On-Package Link Errors.

·     Off-Package Link Errors.

·     PCIe LCRC.

·     PCIe End-to-End CRC.

·     USB ECC.

·     SATA parity check.

Purpose

Convenient for server testing and debugging.

Configuration

Enabled by default.

Remarks

N/A

 

IPMI command classification for RAS reporting

index

Error Types

NETFUN

BMC_LUN

CMD

SubCmd

Description

Reporting Stage

1

McaErr

0x36

0x00

0x2a

0x03

Report MCAerr register

Runtime stage

POST stage

DXE, SMM

2

MemTestErr

0x36

0x00

0x2a

0x06

Report MemTestErr

POST stage

DXE

3

PcieErr

0x36

0x00

0x2a

0x04

Report PcieErr

Runtime stage

POST stage

DXE, SMM

4

SmnErr

0x36

0x00

0x2a

0x05

Report SmnErr

POST stage

DXE

5

NbioErr (Not Implemented)

 

 

 

 

 

 

6

ReportAbsentMemTestErrToBMC (Absent dimm)

0x36

0x00

0x2a

0x08

POST stage

DXE

7

PMIC Error

0x36

0x00

0x2a

0x09

POST stage

DXE

8

SATA Error (Not Implemented)

 

 

 

 

 

 

9

USB Error (Not Implemented)

 

 

 

 

 

 

10

NbioSyncFloodFromPin

0x36

0x00

0x2a

0x0a

POST stage.

DXE

11

DRAM ECC Error

0x36

0x00

0x2a

0x0b

Runtime stage.

SMM

12

McaErr threshold

0x36

0x00

0x04

0x39

Report the MCA CPU err register

Runtime stage

SMM

13

PCIeErr threshold

0x36

0x00

0x04

0x3a

Report PCIe err aer register

Runtime stage

SMM

 

SMI storm suppression

Feature name

SMI storm suppression

Description

AMD supports SMI Polling storm suppression. When the CE error reaches the threshold, the SMI interrupt is triggered to report the error. When the triggered SMI reaches the threshold within a certain period of time, an SMI storm is considered to have occurred, and the Polling Mode is entered for suppression, and SMI interrupts are no longer triggered.

Purpose

When an SMI storm occurs, suppress it to avoid system downtime.

Configuration

RAS Periodic SMI Control is enabled by default.

Remarks

N/A

 

SWITCH RAS

Feature name

SWITCH RAS

Description

SWITCH also has slot numbers and can note and report errors like ordinary PCIe devices.

Purpose

SWITCH can normally trigger RAS related mechanisms.

Configuration

Consistent with ordinary PCIe devices.

Remarks

N/A

 

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Intelligent Storage
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
  • Technical Blogs
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网