01-硬件类故障案例
本章节下载 (270.28 KB)
· 单板状态异常(比如状态为Absent、Fault等)。
· 单板出现异常重启、无法启动或不断重启等。
单板状态包括:Normal、Master 、Standby、Absent和Fault。
Master与Standby分别表示主、备主控板。Normal表示单板处于正常运行状态。
如果发现单板状态出现Fault,或该槽位存在单板但状态却是Absent的,说明单板可能出现故障,请参考1.1.2 故障处理步骤处理。
<Sysname> display device
Slot No. Brd Type Brd Status Subslot Sft Ver Patch Ver
0 NSQM1CGQ4TG24SHA0Normal 0 M9016-V-9153P22 None
1 NONE Absent 0 NONE None
2 NSQM1CGQ4TG24SHA0Normal 0 M9016-V-9153P22 None
3 NONE Absent 0 NONE None
4 NSQM1SUPD0 Master 0 M9016-V-9153P22 None
5 NSQM1SUPD0 Standby 0 M9016-V-9153P22 None
6 NSQM1FWEFGA0 Normal 0 M9016-V-9153P22 None
CPU 1 Normal 0 M9016-V-9153P22
7 NONE Absent 0 NONE None
8 NONE Absent 0 NONE None
9 NONE Absent 0 NONE None
10 NSQM1FAB08E0 Normal 0 M9016-V-9153P22 None
11 NSQM1FAB08E0 Normal 0 M9016-V-9153P22 None
12 NSQM1FAB08E0 Normal 0 M9016-V-9153P22 None
13 NSQM1FAB08E0 Normal 0 M9016-V-9153P22 None
可以通过logfile日志或display version查看设备启动后运行时间来确认单板有没有出现过重启,出现过重启的单板运行时间会明显短于设备上其它单板。请参考1.1.2 故障处理步骤处理。
<Sysname> display version
H3C Comware Software, Version 7.1.064, Release 9153P22
Copyright (c) 2004-2020 New H3C Technologies Co., Ltd. All rights reserved.
H3C SecPath M9016-V uptime is 0 weeks, 4 days, 0 hours, 7 minutes
Last reboot reason : User reboot
Boot image: flash:/M9000-CMW710-BOOT-R9153P22.bin
Boot image version: 7.1.064, Release 9153P22
Compiled Dec 10 2020 14:00:00
System image: flash:/M9000-CMW710-SYSTEM-R9153P22.bin
System image version: 7.1.064, Release 9153P22
Compiled Dec 10 2020 14:00:00
Feature image(s) list:
flash:/M9000-CMW710-DEVKIT-R9153P22.bin, version: 7.1.064
Compiled Dec 10 2020 14:00:00
LPU 0:
Uptime is 0 weeks,4 days,0 hours,3 minutes
H3C SecPath M9016-V LPU with 1 LS1043A Processor
BOARD TYPE: NSQM1CGQ4TG24SHA0
DRAM: 2048M bytes
PCB 1 Version: VER.A
Bootrom Version: 108
CPLD 1 Version: 002
CPLD 2 Version: 001
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
LPU 2:
Uptime is 0 weeks,4 days,0 hours,3 minutes
H3C SecPath M9016-V LPU with 1 LS1043A Processor
BOARD TYPE: NSQM1CGQ4TG24SHA0
DRAM: 2048M bytes
PCB 1 Version: VER.A
Bootrom Version: 108
CPLD 1 Version: 002
CPLD 2 Version: 001
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
MPU(M) 4:
Uptime is 0 weeks,4 days,0 hours,7 minutes
H3C SecPath M9016-V MPU(M) with 1 XLP316 Processor
BOARD TYPE: NSQM1SUPD0
DRAM: 8192M bytes
FLASH: 500M bytes
NVRAM: 512K bytes
PCB 1 Version: VER.A
Bootrom Version: 132
CPLD 1 Version: 004
CPLD 2 Version: 003
CPLD 3 Version: 003
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
MPU(S) 5:
Uptime is 0 weeks,4 days,0 hours,6 minutes
H3C SecPath M9016-V MPU(S) with 1 XLP316 Processor
BOARD TYPE: NSQM1SUPD0
DRAM: 8192M bytes
FLASH: 500M bytes
NVRAM: 512K bytes
PCB 1 Version: VER.A
Bootrom Version: 132
CPLD 1 Version: 001
CPLD 2 Version: 001
CPLD 3 Version: 001
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
LPU 6:
Uptime is 0 weeks,1 day,17 hours,56 minutes
H3C SecPath M9016-V LPU with 1 XLP308 Processor
BOARD TYPE: NSQM1FWEFGA0
DRAM: 2048M bytes
FLASH: 8M bytes
PCB 1 Version: VER.A
PCB 2 Version: VER.B
Bootrom Version: 100
CPLD 1 Version: 002
CPLD 2 Version: 002
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
SLOT 6 CPU 1
CPU type: Multi-core CPU
DDR4 : 49152M bytes
FLASH: 7122M bytes
Board PCB Version: Ver.A
CPLD Version: 2.0
Release Version: SecBlade FW Enhanced-9153P22
FPGA 0 Version: B50506
FPGA 0 DATE: 2020.11.27
FPGA 1 Version: B50506
FPGA 1 DATE: 2020.11.27
Basic BootWare Version:1.03
Extend BootWare Version:1.03
NPU 10:
Uptime is 0 weeks,4 days,0 hours,3 minutes
H3C SecPath M9016-V NPU with 1 XLS208 Processor
BOARD TYPE: NSQM1FAB08E0
DRAM: 1024M bytes
PCB 1 Version: VER.B
Bootrom Version: 518
CPLD 1 Version: 005
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : UserReboot
NPU 11:
Uptime is 0 weeks,3 days,23 hours,46 minutes
H3C SecPath M9016-V NPU with 1 XLS208 Processor
BOARD TYPE: NSQM1FAB08E0
DRAM: 1024M bytes
PCB 1 Version: VER.B
Bootrom Version: 518
CPLD 1 Version: 005
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : ColdReboot
NPU 12:
Uptime is 0 weeks,3 days,23 hours,44 minutes
H3C SecPath M9016-V NPU with 1 XLS208 Processor
BOARD TYPE: NSQM1FAB08E0
DRAM: 1024M bytes
PCB 1 Version: VER.B
Bootrom Version: 511
CPLD 1 Version: 005
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : ColdReboot
NPU 13:
Uptime is 0 weeks,3 days,23 hours,44 minutes
H3C SecPath M9016-V NPU with 1 XLS208 Processor
BOARD TYPE: NSQM1FAB08E0
DRAM: 1024M bytes
PCB 1 Version: VER.B
Bootrom Version: 518
CPLD 1 Version: 005
Release Version: H3C SecPath M9016-V-9153P22
Patch Version : None
Reboot Cause : ColdReboot
(1) 确认单板是否插稳,如检查单板与机框之间是否有空隙,也可以将单板拔出后重新插入。
(2) 将单板放到别的槽位,将框上别的正常的单板放到这个槽位,进一步确认是不是单板或机框槽位故障。
(3) 检查单板面板、内部的指示灯是否发光。
(4) 如果单板是主控板、业务板、网板,请连上串口线,查看配置终端上是否有单板正常启动的显示信息、或单板启动是否异常。
(5) 如确认为单板故障,请更换单板并将故障信息发送H3C技术支持人员分析。
(1) 等待一段时间确认下单板是一直Fault还是Normal后又再次重启。如单板是Normal后又自动重启,请将故障信息发送H3C技术支持人员分析。
(2) 如果单板是主控板、引擎板、网板,请连上串口线,查看配置终端上是否有单板正常启动的显示信息、或单板异常启动信息。如下述主控板启动时出现内存读写测试失败而不断重启,需要检查主控板内存条是否插稳。
(3) 将单板放到别的槽位,进一步确认是不是单板故障。
(4) 如确认为单板故障,请更换单板并将故障信息发送H3C技术支持人员分析。
这里的单板重启是指单板出现过重启,而当前单板状态是Normal。
(1) 通过日志或运行时间分析重启的时间段,确认重启的时间点附近有无用户通过命令行reboot重启或进行单板上下电等操作。
(2) 如果所有单板同时出现重启,请检查设备电源模块是否正常,确认外部电源是否出现过停电,电源进线是否插稳、是否出现松动。
(3) 确认日志中重启时有无出现“Slot X need to be rebooted automatically!” ,“Note:the operating device is sda0,it's not online ”提示信息,这种情况是不正常的。请更换单板并搜集故障信息并发送H3C技术支持人员分析。
(4) 如无法确认,请搜集故障信息并发送H3C技术支持人员分析。
风扇框指示灯异常,设备打印风扇异常信息,如:
%Jun 26 10:12:24:805 2013 H3C DEV/3/FAN_ABSENT: -MDC=1; Chassis 2 Fan 2 is absent.
%Jun 26 10:12:32:805 2013 H3C DEVD/2/DRV_DEV_FAN_CHANGE: -MDC=1; Chassis 2: Fan communication state changed: Fan 1 changed to fault.
%Jun 26 10:12:42:405 2013 H3C DEV/2/FAN_FAILED: -MDC=1; Chassis 2 Fan 1 failed.
(1) 风扇框在位时,用手放在设备出风口,判断是否有出风,如果出风口无风,则风扇异常。
(2) 检查风扇的入风口、出风口是否被挡住或积累太多灰尘。
(3) 检查风扇框是否正常在位,各个风扇的状态是否正常、转速是否相差达到50%以上。如存在异常,建议通过风扇框拔插、更换交叉进一步确认。
(4) 如果故障不能恢复,需要更换该风扇框,但当前没有风扇框,请关闭设备以免发生温度高导致单板烧坏;如果有降温措施保证系统工作在50度以下,可以暂时继续使用设备。
<Sysname> display fan
Chassis 1:
Fan Frame 0 State: Normal
Chassis 2:
Fan Frame 0 State: Normal
设备打印温度过低、过高等告警信息,如:
%Jun 26 10:13:46:233 2013 H3C DEV/4/TEMPERATURE_WARNING: -MDC=1; Temperature is greater than warning upper limit on Chassis 1 slot 2 sensor inflow 1.
(1) 检查环境温度是否正常。如果环境温度较高,请确认原因,比如机房通风不畅、空调制冷故障等。
(2) 检查设备当前的temperature温度是否超出上下的Warning、Alarm门限。也可以用手触摸单板,确认单板是不是很烫,如单板温度很高,请立即检查原因。持续处于较高的温度下,可能会导致单板损坏。
(3) 如果温度过高,请参照1.2 风扇故障确认是否风扇故障导致。
(4) 如果温度值为error或出现明显不合实际的值,可能是通过I2C总线访问单板温度传感器异常(display environment命令)。设备光模块信息访问也是通过I2C总线(display transceiver命令),请继续检查单板读取光模块信息是否正常。如光模块访问正常,请使用temperature-limit命令重新设置单板的温度告警门限值,并通过display environment查看是否设置成功。
[Sysname] temperature-limit chassis 2 slot 1 hotspot 1 0 85 90
<Sysname> display environment
System temperature information (degree centigrade):
----------------------------------------------------------------------
Chassis Slot Sensor Temperature Lower Warning Alarm Shutdown
1 0 inflow 1 35 0 48 60 NA
1 0 hotspot 1 43 0 80 95 NA
1 1 inflow 1 34 0 48 60 NA
1 1 hotspot 1 38 0 80 95 NA
1 2 hotspot 1 49 0 88 100 110
1 2 hotspot 1 49 0 88 100 110
1 3 hotspot 1 43 0 80 97 NA
1 3 hotspot 2 41 0 80 97 NA
1 4 hotspot 1 42 0 80 97 NA
1 4 hotspot 2 40 0 80 97 NA
1 5 hotspot 1 45 0 80 97 NA
1 5 hotspot 2 41 0 80 97 NA
1 6 hotspot 1 53 0 88 100 110
1 6 hotspot 1 53 0 88 100 110
1 7 hotspot 1 55 0 88 100 110
1 7 hotspot 1 55 0 88 100 110
1 8 hotspot 1 67 0 88 100 110
1 8 hotspot 1 67 0 88 100 110
1 9 hotspot 1 61 0 88 100 110
1 9 hotspot 1 61 0 88 100 110
2 0 inflow 1 34 0 85 90 NA
2 0 hotspot 1 42 0 85 90 NA
2 1 inflow 1 36 0 85 90 NA
2 1 hotspot 1 41 0 85 90 NA
2 2 hotspot 1 56 0 88 100 110
2 2 hotspot 1 56 0 88 100 110
2 3 hotspot 1 47 0 80 97 NA
2 3 hotspot 2 44 0 80 97 NA
(5) 如果仍然无法确认故障原因,请搜集信息并发送给H3C技术支持人员协助分析。
单板出现异常重启或者下电
查看diagfile.log日志
<Sysname> more diagfile/diagfile.log
查看设备diagfile.log日志,如果有单板间的hg口故障日志,表明单板间的内联口出现故障,而裁决单板重启,若单板重启后,内联口恢复正常,将日志发给技术支持人员进行分析。
%@12527^Dec 19 16:10:56:906 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.
%@12528^Dec 19 16:10:56:640 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=13; Chassis 1 Slot 13 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 8: The source port went down.
%@12529^Dec 19 16:10:57:376 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=11; Chassis 1 Slot 11 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 3: The source port went down.
%@12530^Dec 19 16:10:56:740 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=12; Chassis 1 Slot 12 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 6: The source port went down.
%@12554^Dec 19 16:11:11:959 2020 M9k DRV/3/FAULT_MONITOR_BITMAP:
Fault PhySlot List: 3
Fault Reason BitMap:
slot : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-----------------------------------------------------
Fabric1 : 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric2 : 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric3 : 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric4 : 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5
-----------------------------------------------------
IO board: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL
%@12555^Dec 19 16:11:11:960 2020 M9k DRV/3/FAULT_MONITOR_REBOOT: Chassis 1 Slot 3: The card will be restarted due to a hardware failure.
查看logfile.log日志
<Sysname> more logfile/logfile.log
查看设备logfile.log日志,如果有单板间的hg口故障日志,表明单板间的内联口出现故障,而裁决单板重启,若单板重启后,内联口恢复正常,将日志发给技术支持人员进行分析。
%@4387931%Dec 19 16:10:56:906 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The connectivity of the internal port failed.
%@4387932%Dec 19 16:10:56:640 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=13; Chassis 1 Slot 13 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 8: The connectivity of the internal port failed.
%@4387933%Dec 19 16:10:57:376 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=11; Chassis 1 Slot 11 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 3: The connectivity of the internal port failed.
%@4387934%Dec 19 16:10:56:740 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=12; Chassis 1 Slot 12 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 6: The connectivity of the internal port failed.
%@4387947%Dec 19 16:11:11:960 2020 M9k DRV/3/FAULT_MONITOR_REBOOT: Chassis 1 Slot 3: The card will be restarted due to a hardware failure.
%@4387948%Dec 19 16:11:12:151 2020 M9k DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on chassis 1 slot 3, type is NSQM1FWEFGA0.
查看diagfile.log日志
<Sysname> more diagfile/diagfile.log
查看设备diagfile.log日志,如果半小时内出现3次单板间的hg口故障而导致到重启,最后出现“The card will be isolated due to a hardware failure”日志时,表明单板间的内联口出现故障并且无法通过重启恢复,而裁决单板下电隔离,此时可查看日志将hg口故障涉及到单板进行更换,并将日志发给技术支持人员进行分析。
%@12574^Dec 19 17:15:53:091 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.
%@12584^Dec 19 17:23:57:002 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.
%@12605^Dec 19 17:32:34:001 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.
%@12615^Dec 19 17:32:54:996 2020 M9k DRV/3/FAULT_MONITOR_BITMAP:
Fault PhySlot List: 10
Fault Reason BitMap:
slot : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-----------------------------------------------------
Fabric1 : 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric2 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric3 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric4 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
-----------------------------------------------------
IO board: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL
%@12616^Dec 19 17:32:54:996 2020 M9k DRV/3/FAULT_MONITOR_ISOLATE: Chassis 1 Slot 10: The card will be isolated due to a hardware failure.
查看logfile.log日志
<Sysname> more logfile/logfile.log
查看设备logfile.log日志,如果半小时内出现3次单板间的hg口故障而导致到重启,最后出现“The card will be isolated due to a hardware failure”日志时,表明单板间的内联口出现故障并且无法通过重启恢复,而裁决单板下电隔离,此时可查看日志将hg口故障涉及到单板进行更换,并将日志发给技术支持人员进行分析。
高端防火墙设备M9000系列可以通过任一视图下display device、dis version和Probe视图下display system internal version查看状态正常的设备的版本情况。
设备的主控版本、业务版本都应该取自相同的版本路径。否则会导致主控、业务板卡版本不一致,使得业务板卡fault状态(比如主控取45SP分支版本,业务板取32SP分支版本,业务板会fault启动不了)
如果主控、业务板卡取的分支不同导致板卡fault,启动不了,会有两种情况。一种是板卡一直fault,板卡一直在重启状态;一种是板卡fault卡住不动。针对上述情况,bootlooder和web升级版本解决问题不在适用,只能通过bootware(Ctrl+B)方式升级版本解决板卡fault问题。板卡fault卡住不动需要插拔板卡让板卡重启,手动连续按Ctrl+B,使得板卡进入bootware升级界面,通过bootware方式升级版本解决板卡fault问题。
<Sysname> display boot-loader
Software images on chassis 1 slot 0:
Current software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 1 slot 1:
Current software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 1 slot 2.1:
Current software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 1 slot 3.1:
Current software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 2 slot 0:
Current software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 2 slot 1:
Current software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
flash:/M9000-CMW710-BOOT-R9153P22.bin
flash:/M9000-CMW710-SYSTEM-R9153P22.bin
flash:/M9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 2 slot 2.1:
Current software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
Software images on chassis 2 slot 3.1:
Current software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Main startup software images:
sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin
sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin
Backup startup software images:
None
GOLD(Generic OnLine Diagnostics,通用在线诊断)通过在设备上执行诊断测试例,来发现硬件故障,并进行问题报告。GOLD检查的硬件故障主要包括:端口、内存、芯片、连接、转发路径以及控制路径是否正常等。
Hg Monitor主要是监测以网板为中心的内联口(HG/XE)的运行状态,主要包括网板和接口板之间的HG口,网板和业务板前插之间的XE口,以及网板和主控板之间的HG口(针对M9000设备)。
针对HG链路中的端口DOWN进行错误探测,通过先重启后隔离的方式,尝试恢复解决问题。基本处理流程如下
· HG Monitor负责故障监控和故障上报。在监控周期内(10秒),网板的HG Monitor发现故障,向主控GOLD上报故障。
· GOLD Fault Monitor负责故障仲裁和故障板隔离。主控每10秒检测是否有故障上报。若有故障上报,根据板在位情况仲裁故障板卡并隔离。隔离动作包括三次单板重启和单板下电。三次重启后故障仍然存在,则将裁决对象单板下电隔离。
hg-monitor和fault-monitor裁决相关的日志记录在diag日志中,syslog也会记录接口down事件,常见日志打印举例:
hg-monitor检测到有hg口down
syslog:
DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.
%Dec 12 15:46:57:103 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.
%Dec 12 15:46:56:897 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.
%Dec 12 15:46:57:828 2020 M9016_1 DRV/3/FAULT_MONITOR_REBOOT: Slot 10: The card will be restarted due to a hardware failure.
%Dec 12 15:46:58:219 2020 M9016_1 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 10, type is NSQ1FAB08D0.
Diag日志:
%@1211933^Dec 12 15:46:37:402 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The source port went down.
%@1211934^Dec 12 15:46:48:046 2020 M9016_1 DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.
%@1211935^Dec 12 15:46:49:213 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The source port went down.
%@1211936^Dec 12 15:46:57:103 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.
%@1211937^Dec 12 15:46:56:897 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.
%@1211938^Dec 12 15:46:57:827 2020 M9016_1 DRV/3/FAULT_MONITOR_BITMAP:
Fault PhySlot List: 10
Fault Reason BitMap:
slot : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-----------------------------------------------------
Fabric1 : 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric2 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric3 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric4 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
-----------------------------------------------------
IO board: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL
%@1211939^Dec 12 15:46:57:828 2020 M9016_1 DRV/3/FAULT_MONITOR_REBOOT: Slot 10: The card will be restarted due to a hardware failure
Syslog:
%Dec 12 16:13:34:637 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The connectivity of the internal port failed.
%Dec 12 16:13:45:189 2020 M9016_1 DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.
[M9016_1-probe]%Dec 12 16:13:54:437 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.
%Dec 12 16:13:53:927 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.
%Dec 12 16:13:53:959 2020 M9016_1 DRV/3/FAULT_MONITOR_ISOLATE: Slot 10: The card will be isolated due to a hardware failure.
%Dec 12 16:13:53:999 2020 M9016_1 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 10, type is NSQ1FAB08D0.
Diag日志:
%@1211957^Dec 12 16:13:53:958 2020 M9016_1 DRV/3/FAULT_MONITOR_BITMAP:
Fault PhySlot List: 10
Fault Reason BitMap:
slot : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-----------------------------------------------------
Fabric1 : 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric2 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric3 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fabric4 : 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
-----------------------------------------------------
IO board: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL
%@1211958^Dec 12 16:13:53:959 2020 M9016_1 DRV/3/FAULT_MONITOR_ISOLATE: Slot 10: The card will be isolated due to a hardware failure.
网板3次裁决重启后,故障仍存在,网板被裁决下电隔离
对故障件进行备件更换
· 命令行:display hgmonitor info X Y
· 含义:查看各单板hgmonitor信息,X代表槽位号,Y代表芯片号
· 举例:查看11槽网板0芯片上的hgmonitor信息,包含芯片0上和接口板互联hg口的故障记录,和业务板互联的xe口的故障记录,和主控板互联hg口的故障记录
· 命令行:display hgmonitor judge X
· 含义:查看各单板hgmonitor故障统计,X代表槽位号
· 举例:查看13槽网板上的hgmonitor故障统计,可查询到slot 13 unit 0 port 1端口故障
· 命令行:debug gold show monitor fault slot X
· 含义:查看当前主控板上隔离的单板数量,X代表主控板槽位号
· 举例:查看5号主主控上隔离的单板数量,最大隔离数为1,当前隔离数为0
表1-1 故障诊断命令
|
命令 |
说明 |
|
display device |
显示设备信息,检查各单板的状态是否正常 |
|
display environment |
显示设备的温度信息,检查环境温度是否正常(是否超出温度告警阈值) |
|
display fan |
显示设备内置风扇的工作状态 |
|
display power |
显示设备上的电源系统信息。详细信息包括下列信息: · 电源管理使能状态 · 电源类型、额定输入电压和额定输出功率 · 在位电源模块的状态 |
|
display version |
显示系统版本信息、单板的运行时间以及最后一次重启的原因 |
|
save |
将当前配置保存到指定文件 |
|
temperature-limit |
设置设备的温度告警门限 |
不同款型规格的资料略有差异, 详细信息请向具体销售和400咨询。H3C保留在没有任何通知或提示的情况下对资料内容进行修改的权利!
