- Table of Contents
- Related Documents
-
Title | Size | Download |
---|---|---|
01-IRF Troubleshooting Guide | 133.08 KB |
Troubleshooting virtual technologies
IRF issues
IRF setup failure
Symptom
Several devices cannot form an IRF fabric, or a new member device cannot join an existing IRF fabric.
Common causes
The following are the common causes of this type of issue:
· When you use member devices to set up a new IRF fabric, the total number of IRF member devices exceeds the upper limit. When you add a new member device to an existing IRF fabric, the number of existing IRF member devices has reached the upper limit in that IRF fabric.
· The device configuration does not meet the IRF setup requirements.
· The IRF physical interfaces, cables, and physical topology do not meet the IRF setup requirements. As a result, the IRF links cannot come up.
Troubleshooting flow
Figure 1 shows the troubleshooting flowchart.
Figure 1 Flowchart for troubleshooting IRF setup failure
Solution
IMPORTANT: This section only covers the routine requirements for setting up an IRF fabric. For more information about the requirements for setting up an IRF fabric, see IRF configuration in the configuration guides for the product. |
1. Identify whether the number of IRF member devices has reached the maximum value supported by the system.
Execute the display irf command to view the number of member devices in the current IRF fabric. If the number of IRF member devices has reached the maximum value supported by the system, you cannot add any member device to the IRF fabric.
The maximum number of member devices in an IRF fabric varies by device model. For example, an S12500-AF IRF fabric can contain a maximum of four member devices.
2. Verify that all member devices run the same version of software.
Execute the display version command to display the current software version on each device. Only devices running the same software version can form an IRF fabric.
Typically, the IRF auto-update feature (enabled by default) can automatically synchronize the software version of a member device with the software version of the master device. However, the synchronization might fail when the gap between the software versions is large. In this case, you must manually upgrade the software of that member device.
If the member device has two MPUs, you must upgrade software for both the MPUs to ensure software consistency across them.
3. Verify that the IRF configuration on each member device meets the IRF setup requirements.
a. Verify that all member devices are operating in IRF mode.
Some products are shipped in IRF mode and do not support mode conversion. Some products are shipped in standalone mode and support mode conversion. If a device supports the display irf link or display irf topology command, the device is operating in IRF mode. If a device does not support either of the commands, the device is operating in standalone mode. To enable IRF mode for the device, execute the chassis convert mode irf command in system view.
<Sysname> display irf ?
> Redirect it to a file
>> Redirect it to a file in append mode
configuration IRF configuration that will be valid after reboot
link Display link status
topology Topology information
| Matching output
<cr>
b. Verify that the member ID of each member device is unique across the IRF fabric.
Execute the display irf command to display the member IDs of the member devices in the IRF fabric. Each member device in the IRF fabric must use a unique member ID. Devices that use the same member ID cannot establish an IRF fabric or join the same IRF fabric. The default member ID for a device is 1. In standalone mode, you can change the IRF member ID of a device by using the irf member command. In IRF mode, you can change the IRF member ID of a device by using the irf member renumber command. For the new member ID to take effect, you must save the configuration and reboot the device.
c. Verify that each member device is shipped with a unique bridge MAC address.
Member devices shipped with the same bridge MAC address cannot join the same IRF fabric. Typically, each device is shipped with a unique bridge MAC address across the network. If IRF setup fails and the Failed to stack because of the same bridge MAC addresses message is generated, two devices are shipped with the same bridge MAC address. In this case, use the irf mac-address command to change the bridge MAC address on one of the devices. (Support for the irf mac-address command depends on the device model.)
d. Verify that all member devices in the same IRF fabric use the same IRF domain ID.
The IRF domain ID does not affect IRF fabric setup and merge, but it affects multi-active detection (MAD). To ensure that MAD can operate correctly, make sure all member devices in the same IRF fabric use the same IRF domain ID. By default, the IRF domain ID is 0. To obtain the IRF domain ID of a device, execute the display irf command on that device and check the value in the Domain ID field of the command output. If the IRF domain ID of a device is different from that of the other devices, execute the irf domain command to change the IRF domain ID on the device.
4. Verify that the IRF ports are in up state.
An IRF port is a logical interface that connects IRF member devices. To use an IRF port, you must bind a minimum of one physical interface to it. To obtain the status of IRF ports, execute the display irf topology command and check the value in the Link field of the command output.
<Sysname> display irf topology
Topology Info
-------------------------------------------------------------------------
IRF-Port1 IRF-Port2
MemberID Link neighbor Link neighbor Belong To
2 DIS --- UP 1 5e40-08d9-0104
1 UP 2 DIS --- 5e40-08d9-0104
¡ If the value of the Link field is UP for an IRF port on a member device, the IRF port is correctly connected and no action is required.
¡ If the value of the Link field is DIS for an IRF port on a member device, no IRF physical interfaces have been bound to the IRF port. If binding IRF physical interfaces to the IRF port is required, execute the port group interface command in IRF port view to bind IRF physical interfaces to the IRF port.
¡ If the value of the Link field is DOWN for an IRF port on a member device, execute the display irf link command to examine whether the IRF physical interfaces bound to the IRF port are in UP state.
- If a minimum of one IRF physical interface is up when the IRF port is down, the configuration of the IRF port might not be activated. To activate the IRF port configuration, execute the irf-port-configuration active command in system view.
- If no IRF physical interfaces are in UP state, proceed to step 5 to troubleshoot the IRF physical interface issue.
¡ If the value of the Link field is TIMEOUT for an IRF port on a member device, the IRF hello packets have timed out and the IRF link has communication issues. Perform the following tasks to locate the timeout issue of IRF packets:
- Identify whether the IRF packet exchange failure is caused by an anomaly of the neighboring IRF port. For this purpose, log in to the neighboring device at the other end of the IRF link, execute the display irf topology and display irf link commands on the neighboring device, and then locate the issue based on the command output.
- Verify that no network loops exist on the IRF fabric, as they lead to packet loss. To identify whether a network loop exists, execute the display counters rate inbound interface command to display the packet rate statistics of the IRF physical interfaces and examine whether a packet storm has occurred on the IRF link. If a packet storm exists, check for a physical loop and examine whether the VLAN and STP settings are correct. If a physical loop exists or the settings are incorrect, remove the loop or correct the settings to resolve the packet storm issue.
- Execute the display device command to examine whether the switching fabric modules are operating correctly. If not, first troubleshoot the issue with the switching fabric module.
¡ If the value of the Link field is ISOLATE for an IRF port on a member device, the member device is isolated. In this case, execute the display logbuffer | include STM stackability check command, and then proceed according to the command output.
- If the command output includes the STM stackability check: Product series is inconsistency message, the model of the member device does not meet the IRF setup requirements. In this case, proceed to step 7.
- If the command output includes the STM stackability check: Product xxx is inconsistency message, where xxx might represent the system operating mode or other settings that require consistency across member devices, the current system parameter configuration does not meet the IRF setup requirements. In this case, proceed to step 8.
Execute the display irf link command to check the state of IRF physical interfaces.
¡ If the value of the Interface field is disable for an IRF port, no IRF physical interfaces have been bound to the IRF port.
¡ If the value of the Interface field for an IRF port is one or multiple physical interface names, continue to check the Status field. The value and meaning of the Status field are as follows:
- UP—An IRF physical link is up. In this state, no action is required.
- DOWN—An IRF physical link is down. In this case, verify that the transceiver module and fiber or cable of the IRF physical interface is operating correctly. You must use a physical interface that meets the product requirements as an IRF physical interface and use a connection medium that meets the product requirements to connect the IRF physical interface. When the transceiver module and fiber or cable of the IRF physical interface is operating correctly, proceed to step 6.
- ADM—An IRF physical interface is shut down by using the shutdown command. In this state, the IRF physical interface is administratively down. To bring up the IRF physical interface, you must execute the undo shutdown command.
- ABSENT—An IRF physical interface does not exist. You can insert the card or expansion interface module that hosts the interface.
6. Verify that the IRF physical connections meet the IRF connection requirements.
Perform the following operations to locate an IRF physical connection issue:
a. On each member device, execute the display irf configuration command to view the binding relationship between IRF ports and IRF physical interfaces. Verify that the IRF physical interfaces bound to IRF ports are consistent with those on the IRF physical connections. If not, reconfigure the IRF port bindings or reconnect physical interfaces.
b. Verify that the IRF physical interfaces are correctly connected. Make sure the IRF physical interfaces of IRF-port 1 on one member device are connected to the IRF physical interfaces of IRF-port 2 on another member device. If the IRF fabric contains only two member devices, you must connect them in a daisy-chain topology rather than a ring topology.
7. Verify that the hardware of the member devices meets the IRF setup requirements.
You must use hardware that meets the IRF setup requirements to set up an IRF fabric. For example, the device model, MPUs, interface modules, and IRF physical interfaces must meet the IRF setup requirements. You can perform the following tasks to determine whether the device hardware meets the IRF setup requirements:
# Execute the display version command to check the device model.
<Sysname> display version
H3C Comware Software, Version 7.1.070, Alpha 704228
Copyright (c) 2004-2021 New H3C Technologies Co., Ltd. All rights reserved.
H3C S12508X-AF uptime is 0 weeks, 0 days, 2 hours, 31 minutes
Last reboot reason : Cold reboot
...
# Execute the display device command to check the models of the MPUs and interface modules.
<Sysname> display device
Slot Type State Subslot Soft Ver Patch Ver
1/0 LSXM1SUPB1 Master 0 S12508X-AF-704228 None
1/1 NONE Absent 0 NONE None
1/2 NONE Absent 0 NONE None
1/3 LSXM1CGQ18QGHB1 Normal 0 S12508X-AF-704228 None
...
# Execute the display interface command to check the rate and type of each IRF physical interface.
<Sysname> display interface ten-gigabitethernet 0/0/6
Ten-GigabitEthernet0/0/6
Current state: UP
IP packet frame type: Ethernet II, hardware address: 4077-a9ee-ce85
Description: Ten-GigabitEthernet0/0/6 Interface
Bandwidth: 10000000 kbps
Loopback is not set
Media type is optical fiber, port is 10G_BASE_SR_SFP
10Gbps-speed mode, full-duplex mode
Link speed type is autonegotiation, link duplex type is autonegotiation
...
8. Verify that the system parameter settings meet the IRF setup requirements.
To set up an IRF fabric, all member devices must use the same system parameter settings, including the same system operating mode, VXLAN hardware resource mode, route hardware resource mode, and maximum number of ECMP routes. (The restrictions vary by device model.)
¡ To display the system operating mode on a device, use the display system-working-mode command. To change the system operating mode of the device, use the system-working-mode command. For the mode change to take effect, you must save the configuration and reboot the device.
¡ To display the hardware resource modes on a device, use the display hardware-resource command. To change the VXLAN and route hardware resource modes of the device, use the hardware-resource vxlan and hardware-resource routing-mode commands, respectively. For the mode changes to take effect, you must save the configuration and reboot the device.
¡ To display the maximum number of IPv4 ECMP routes and the maximum number of IPv6 ECMP routes supported by the system, use the display max-ecmp-num and display ipv6 max-ecmp-num commands, respectively. To change the maximum number of IPv4 ECMP routes and the maximum number of IPv6 ECMP routes, use the max-ecmp-num and ipv6 max-ecmp-num commands, respectively. For the changes to take effect, you must save the configuration and reboot the device.
9. If the issue persists, collect the following information and contact Technical Support:
¡ Results of each step.
¡ The configuration file, log messages, and alarm messages.
Related alarm and log messages
Alarm messages
Module name: HH3C-STACK-MIB
· hh3cStackPhysicalIntfLinkDown(1.3.6.1.4.1.25506.2.91.6.0.8)
· hh3cStackPhysicalIntfRxTimeout (1.3.6.1.4.1.25506.2.91.6.0.9)
Log messages
· STM/3/STM_LINK_DOWN
· STM/2/STM_LINK_TIMEOUT
· STM/6/STM_LINK_UP
· STM/4/STM_SAMEMAC
· STM/3/STM_SOMER_CHECK
Unexpected reboot of an IRF member device
Symptom
The master device or a subordinate device in an IRF fabric reboots unexpectedly. As a result, the IRF fabric splits.
Common causes
The following are the common causes of this type of issue:
· The subordinate device automatically reboots to load startup software images from the master device.
· IRF merge causes the subordinate device to reboot.
· A software or hardware fault causes the device to reboot unexpectedly in an attempt to fix the fault.
Troubleshooting flow
Figure 2 shows the troubleshooting flowchart.
Figure 2 Flowchart for troubleshooting unexpected reboot of an IRF member device
Solution
1. Identify whether the rebooted device is a subordinate device.
¡ If the device is a subordinate device, proceed to step 2.
¡ If the device is the master device, proceed to step 4.
2. Identify whether the reboot is caused by the software auto-update feature.
¡ If the reboot is caused by the software auto-update feature, no action is required.
¡ If the reboot is not caused by the software auto-update feature, proceed to step 3.
To identify whether the reboot of the subordinate device is caused by the software auto-update feature, execute the display system internal irf msg command in probe view. If the command output includes the Version is different, and the sender CPU MAC is xxxx-xxxx-xxxx (chassis xx slot xx). message, the reboot of the subordinate device with the CPU MAC of xxxx-xxxx-xxxx is caused by the software auto-update feature.
3. Identify whether the reboot is caused by an IRF merge.
¡ If the reboot is caused by an IRF merge, locate the causes of the IRF split and merge, and eliminate security risks to prevent the same issue from causing an IRF split and merge again.
¡ If the reboot is not caused by an IRF merge, proceed to step 4.
To identify whether the reboot of the subordinate device is caused by an IRF merge:
¡ Execute the display kernel reboot command on the IRF fabric to obtain the device reboot reason after the device reboots. If the value for the Reason field is 0x7, the device reboots due to an IRF merge. The value for the Slot field represents the number of the slot that triggers the reboot, and the value for the Target Slot field represents the number of the slot that has been rebooted.
<Sysname> display kernel reboot 1
--------------------- Reboot record 1 ---------------------
Recorded at : 2021-12-06 00:10:05.440616
Occurred at : 2021-12-06 00:10:05.440616
Reason : 0x7
Thread : STM_Main (TID: 232)
Context : thread context
Slot : 1
Target Slot : 2
Cpu : 0
VCPU ID : 2
Kernel module info : module name (system) module address (0xffffffffc0074000)
module name (addon) module address (0xffffffffc0008000)
¡ Execute the display system internal irf msg | include reboot command in probe view on the IRF fabric. If the master device has sent a reboot message, the reboot of the subordinate device is caused by an IRF merge.
19> Send reboot pkt, src_addr 5e40-08d9-0104 (chassis 1 slot 1), at 2022/1/5 15:42:48:386
4. Examine whether the reboot is caused by a software or hardware fault.
Execute the display version command, check the Reboot Cause field for the reboot cause, and handle the reboot issue according to the reboot cause as shown in Table 1.
<Sysname> display version
...
Reboot Cause : ColdReboot
[SubSlot 0] 24GE+4SFP Plus+POE
Table 1 Device reboot causes and recommended actions
Value for the Reboot Cause field |
Reboot cause description |
Recommended actions |
AutoUpdateReboot |
The reboot was caused by an automatic software upgrade. |
No action is required. |
BootwareBackupReboot |
Bootware backup area reboot. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
ColdReboot |
The reboot was caused by a power cycle. |
Check the power supply environment of the device to ensure that the power supply module can provide power correctly to the device. |
CryptographicModuleSelftestsFailedReboot |
The reboot was caused by an algorithm library self-test failure. |
Upgrade the software version as soon as possible. |
CryptotestFailReboot |
The reboot was caused by a cryptographic algorithm library self-check failure. |
Upgrade the software version as soon as possible. |
DeadLoopReboot |
The reboot was caused by a kernel thread dead loop. |
Collect log messages, diagnostic messages, and the command output from the display kernel deadloop 20 verbose command executed for the reboot slot, and then contact Technical Support for help. |
DEVHandShakeReboot |
The reboot was caused by a device management handshake failure. |
Execute the display device command to identify whether the active MPU is in Normal state. If the state is not Normal, the MPU might fail. You must resolve the MPU issue first. |
GoldMonReboot |
The Generic OnLine Diagnostics (GOLD) module detected an exception. |
Perform the following operations to locate the reboot cause: 1. Execute the display diagnostic content command, check the Correct-action field, and find that the corrective action is reboot. Then, obtain the time when the device was rebooted and troubleshoot issues occurred around the time. 2. Execute the display diagnostic event-log command to display GOLD log entries. 3. Locate the reboot cause based on the command output and resolve the issue. |
IRFMergeReboot |
The reboot was caused by an IRF merge. |
An IRF link failure can cause an IRF split. Once the IRF link is recovered, the IRF fabric will automatically merge. To prevent the same issue from causing an IRF split and merge again, locate and resolve the issue. |
KernelAbnormalReboot |
A CPU, host memory, or software issue led to a system kernel error. |
Collect log messages, diagnostic messages, and the command output from the display kernel exception 10 verbose and display kernel reboot 20 verbose commands, and then contact Technical Support for help. |
KeyReboot |
The RESET key was pressed. |
Avoid accidental operations. |
LicenseTimeoutReboot |
The license has expired. |
Install a formal license as soon as possible. |
MasterLostReboot |
The master slot was rebooted while the current slot was performing a bulk backup operation. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
MemoryexhaustReboot |
The amount of free memory is lower than the threshold value. |
Identify the cause of high memory usage and resolve the high memory usage fault accordingly. For example, too many ACL entries can cause high memory usage. |
PdtReboot |
The reboot was required by the driver. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
SelfReboot |
The current slot was reset. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
StandbyCannotUpdateReboot |
The standby MPU cannot be upgraded to the active MPU. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
StandbySwitchReboot |
The original active MPU was rebooted after an active/standby switchover. |
Identify the cause of the active/standby switchover and resolve the fault that causes the active/standby switchover to prevent another unexpected active/standby switchover. For example, software upgrade can cause an active/standby switchover. |
UserReboot |
The reboot was caused by a manual operation through the CLI, the network manager, or the Web interface. |
No action is required. |
WarmReboot |
The reboot might be caused by various reasons, for example, poor contact of board pins. |
Collect log messages and diagnostic messages, and then contact Technical Support for help. |
WatchDogReboot |
The watchdog detected a system fault, for example, a CPU, memory, software, or hardware fault. |
Use the display hardware-failure-detection command to locate the cause of the fault based on the command output, and troubleshoot the fault. |
5. If the issue persists, collect the following information and contact Technical Support:
¡ For example, the active MPU is in slot 16 and the standby MPU is in slot 17. The standby MPU reboots. To resolve the issue, collect the output information of the following commands:
- Execute the following commands in any view:
display version
display device
display diagnostic-information
display kernel deadloop 20 verbose slot 16
display kernel exception 10 verbose slot 16
display kernel reboot 20 verbose slot 16
- Execute the following commands in probe view to collect information:
local logbuffer slot 17 display
local logbuffer slot 17 display from-highmemory
display reboot last-time slot 17
display system internal version
display diag-msg start-msg slot 17
|
NOTE: Support for these commands depends on the device model and software version. |
¡ The configuration file, log messages, and alarm messages.
Related alarm and log messages
Alarm messages
N/A
Log messages
· DEV/1/AUTO_SWITCH_FAULT_REBOOT
· DEV/5/BOARD_REBOOT
· DEV/1/BOARD_RUNNING_FAULT_REBOOT
· DEV/5/CHASSIS_REBOOT
· DEV/5/SUBCARD_REBOOT
· DEV/5/SYSTEM_REBOOT
· STM/4/STM_MERGE