• 产品与解决方案
  • 行业解决方案
  • 服务
  • 支持
  • 合作伙伴
  • 新华三人才研学中心
  • 关于我们

03-硬件类

01-硬件类故障案例

本章节下载  (270.28 KB)

docurl=/cn/Service/Document_Software/Document_Center/Home/Security/00-Public/Maintenance/Maintenance_Treasure/H3C_SecPath_M9000-5150/03/202208/1663954_30005_0.htm

01-硬件类故障案例

1.1  单板故障

1.1.1  故障描述

·     单板状态异常(比如状态为Absent、Fault等)。

·     单板出现异常重启、无法启动或不断重启等。

1. 单板状态查询

单板状态包括:Normal、Master 、Standby、Absent和Fault。

MasterStandby分别表示主、备主控板。Normal表示单板处于正常运行状态。

如果发现单板状态出现Fault,或该槽位存在单板但状态却是Absent的,说明单板可能出现故障,请参考1.1.2  故障处理步骤处理。

<Sysname> display device

Slot No. Brd Type         Brd Status   Subslot Sft Ver                Patch Ver

 0       NSQM1CGQ4TG24SHA0Normal       0       M9016-V-9153P22        None

 1       NONE             Absent       0       NONE                   None

 2       NSQM1CGQ4TG24SHA0Normal       0       M9016-V-9153P22        None

 3       NONE             Absent       0       NONE                   None

 4       NSQM1SUPD0       Master       0       M9016-V-9153P22        None

 5       NSQM1SUPD0       Standby      0       M9016-V-9153P22        None

 6       NSQM1FWEFGA0     Normal       0       M9016-V-9153P22        None

         CPU 1            Normal       0       M9016-V-9153P22

 7       NONE             Absent       0       NONE                   None

 8       NONE             Absent       0       NONE                   None

 9       NONE             Absent       0       NONE                   None

 10      NSQM1FAB08E0     Normal       0       M9016-V-9153P22        None

 11      NSQM1FAB08E0     Normal       0       M9016-V-9153P22        None

 12      NSQM1FAB08E0     Normal       0       M9016-V-9153P22        None

 13      NSQM1FAB08E0     Normal       0       M9016-V-9153P22        None

2. 单板重启确认

可以通过logfile日志或display version查看设备启动后运行时间来确认单板有没有出现过重启,出现过重启的单板运行时间会明显短于设备上其它单板。请参考1.1.2  故障处理步骤处理。

<Sysname> display version

H3C Comware Software, Version 7.1.064, Release 9153P22

Copyright (c) 2004-2020 New H3C Technologies Co., Ltd. All rights reserved.

H3C SecPath M9016-V uptime is 0 weeks, 4 days, 0 hours, 7 minutes

Last reboot reason : User reboot

 

Boot image: flash:/M9000-CMW710-BOOT-R9153P22.bin

Boot image version: 7.1.064, Release 9153P22

  Compiled Dec 10 2020 14:00:00

System image: flash:/M9000-CMW710-SYSTEM-R9153P22.bin

System image version: 7.1.064, Release 9153P22

  Compiled Dec 10 2020 14:00:00

Feature image(s) list:

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin, version: 7.1.064

    Compiled Dec 10 2020 14:00:00

 

 

LPU 0:

Uptime is 0 weeks,4 days,0 hours,3 minutes

H3C SecPath M9016-V LPU with 1 LS1043A Processor

BOARD TYPE:         NSQM1CGQ4TG24SHA0

DRAM:               2048M bytes

PCB 1 Version:      VER.A

Bootrom Version:    108

CPLD 1 Version:     002

CPLD 2 Version:     001

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

 

LPU 2:

Uptime is 0 weeks,4 days,0 hours,3 minutes

H3C SecPath M9016-V LPU with 1 LS1043A Processor

BOARD TYPE:         NSQM1CGQ4TG24SHA0

DRAM:               2048M bytes

PCB 1 Version:      VER.A

Bootrom Version:    108

CPLD 1 Version:     002

CPLD 2 Version:     001

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

 

MPU(M) 4:

Uptime is 0 weeks,4 days,0 hours,7 minutes

H3C SecPath M9016-V MPU(M) with 1 XLP316 Processor

BOARD TYPE:         NSQM1SUPD0

DRAM:               8192M bytes

FLASH:              500M bytes

NVRAM:              512K bytes

PCB 1 Version:      VER.A

Bootrom Version:    132

CPLD 1 Version:     004

CPLD 2 Version:     003

CPLD 3 Version:     003

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

 

MPU(S) 5:

Uptime is 0 weeks,4 days,0 hours,6 minutes

H3C SecPath M9016-V MPU(S) with 1 XLP316 Processor

BOARD TYPE:         NSQM1SUPD0

DRAM:               8192M bytes

FLASH:              500M bytes

NVRAM:              512K bytes

PCB 1 Version:      VER.A

Bootrom Version:    132

CPLD 1 Version:     001

CPLD 2 Version:     001

CPLD 3 Version:     001

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

 

LPU 6:

Uptime is 0 weeks,1 day,17 hours,56 minutes

H3C SecPath M9016-V LPU with 1 XLP308 Processor

BOARD TYPE:         NSQM1FWEFGA0

DRAM:               2048M bytes

FLASH:              8M bytes

PCB 1 Version:      VER.A

PCB 2 Version:      VER.B

Bootrom Version:    100

CPLD 1 Version:     002

CPLD 2 Version:     002

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

SLOT 6 CPU 1

CPU type:           Multi-core CPU

DDR4 :              49152M bytes

FLASH:              7122M bytes

Board PCB Version:  Ver.A

CPLD Version:       2.0

Release Version:    SecBlade FW Enhanced-9153P22

FPGA 0 Version:     B50506

FPGA 0 DATE:        2020.11.27

FPGA 1 Version:     B50506

FPGA 1 DATE:        2020.11.27

Basic  BootWare Version:1.03

Extend BootWare Version:1.03

 

NPU 10:

Uptime is 0 weeks,4 days,0 hours,3 minutes

H3C SecPath M9016-V NPU with 1 XLS208 Processor

BOARD TYPE:         NSQM1FAB08E0

DRAM:               1024M bytes

PCB 1 Version:      VER.B

Bootrom Version:    518

CPLD 1 Version:     005

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     UserReboot

 

NPU 11:

Uptime is 0 weeks,3 days,23 hours,46 minutes

H3C SecPath M9016-V NPU with 1 XLS208 Processor

BOARD TYPE:         NSQM1FAB08E0

DRAM:               1024M bytes

PCB 1 Version:      VER.B

Bootrom Version:    518

CPLD 1 Version:     005

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     ColdReboot

 

NPU 12:

Uptime is 0 weeks,3 days,23 hours,44 minutes

H3C SecPath M9016-V NPU with 1 XLS208 Processor

BOARD TYPE:         NSQM1FAB08E0

DRAM:               1024M bytes

PCB 1 Version:      VER.B

Bootrom Version:    511

CPLD 1 Version:     005

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     ColdReboot

 

NPU 13:

Uptime is 0 weeks,3 days,23 hours,44 minutes

H3C SecPath M9016-V NPU with 1 XLS208 Processor

BOARD TYPE:         NSQM1FAB08E0

DRAM:               1024M bytes

PCB 1 Version:      VER.B

Bootrom Version:    518

CPLD 1 Version:     005

Release Version:    H3C SecPath M9016-V-9153P22

Patch Version  :    None

Reboot Cause  :     ColdReboot

1.1.2  故障处理步骤

1. 单板状态Absent

(1)     确认单板是否插稳,如检查单板与机框之间是否有空隙,也可以将单板拔出后重新插入。

(2)     将单板放到别的槽位,将框上别的正常的单板放到这个槽位,进一步确认是不是单板或机框槽位故障。

(3)     检查单板面板、内部的指示灯是否发光。

(4)     如果单板是主控板、业务板、网板,请连上串口线,查看配置终端上是否有单板正常启动的显示信息、或单板启动是否异常。

(5)     如确认为单板故障,请更换单板并将故障信息发送H3C技术支持人员分析。

2. 单板状态Fault

(1)     等待一段时间确认下单板是一直Fault还是Normal后又再次重启。如单板是Normal后又自动重启,请将故障信息发送H3C技术支持人员分析。

(2)     如果单板是主控板、引擎板、网板,请连上串口线,查看配置终端上是否有单板正常启动的显示信息、或单板异常启动信息。如下述主控板启动时出现内存读写测试失败而不断重启,需要检查主控板内存条是否插稳。

(3)     将单板放到别的槽位,进一步确认是不是单板故障。

(4)     如确认为单板故障,请更换单板并将故障信息发送H3C技术支持人员分析。

3. 单板重启

这里的单板重启是指单板出现过重启,而当前单板状态是Normal。

(1)     通过日志或运行时间分析重启的时间段,确认重启的时间点附近有无用户通过命令行reboot重启或进行单板上下电等操作。

(2)     如果所有单板同时出现重启,请检查设备电源模块是否正常,确认外部电源是否出现过停电,电源进线是否插稳、是否出现松动。

(3)     确认日志中重启时有无出现“Slot X need to be rebooted automatically!” ,“Note:the operating device is sda0,it's not online ”提示信息,这种情况是不正常的。请更换单板并搜集故障信息并发送H3C技术支持人员分析。

(4)     如无法确认,请搜集故障信息并发送H3C技术支持人员分析。

1.2  风扇故障

1.2.1  故障描述

风扇框指示灯异常,设备打印风扇异常信息,如:

%Jun 26 10:12:24:805 2013 H3C DEV/3/FAN_ABSENT: -MDC=1; Chassis 2 Fan 2 is absent.

%Jun 26 10:12:32:805 2013 H3C DEVD/2/DRV_DEV_FAN_CHANGE: -MDC=1; Chassis 2: Fan communication state changed: Fan 1 changed to fault.

%Jun 26 10:12:42:405 2013 H3C DEV/2/FAN_FAILED: -MDC=1; Chassis 2 Fan 1 failed.

1.2.2  故障处理步骤

(1)     风扇框在位时,用手放在设备出风口,判断是否有出风,如果出风口无风,则风扇异常。

(2)     检查风扇的入风口、出风口是否被挡住或积累太多灰尘。

(3)     检查风扇框是否正常在位,各个风扇的状态是否正常、转速是否相差达到50%以上。如存在异常,建议通过风扇框拔插、更换交叉进一步确认。

(4)     如果故障不能恢复,需要更换该风扇框,但当前没有风扇框,请关闭设备以免发生温度高导致单板烧坏;如果有降温措施保证系统工作在50度以下,可以暂时继续使用设备。

<Sysname> display fan

 Chassis 1:

 Fan Frame 0  State: Normal

 Chassis 2:

 Fan Frame 0  State: Normal

1.3  温度告警

1.3.1  故障描述

设备打印温度过低、过高等告警信息,如:

%Jun 26 10:13:46:233 2013 H3C DEV/4/TEMPERATURE_WARNING: -MDC=1; Temperature is greater than warning upper limit on Chassis 1 slot 2 sensor inflow 1.

1.3.2  故障处理步骤

(1)     检查环境温度是否正常。如果环境温度较高,请确认原因,比如机房通风不畅、空调制冷故障等。

(2)     检查设备当前的temperature温度是否超出上下的Warning、Alarm门限。也可以用手触摸单板,确认单板是不是很烫,如单板温度很高,请立即检查原因。持续处于较高的温度下,可能会导致单板损坏。

(3)     如果温度过高,请参照1.2  风扇故障确认是否风扇故障导致。

(4)     如果温度值为error或出现明显不合实际的值,可能是通过I2C总线访问单板温度传感器异常(display environment命令)。设备光模块信息访问也是通过I2C总线(display transceiver命令),请继续检查单板读取光模块信息是否正常。如光模块访问正常,请使用temperature-limit命令重新设置单板的温度告警门限值,并通过display environment查看是否设置成功。

[Sysname] temperature-limit chassis 2 slot 1 hotspot 1 0 85 90

<Sysname> display environment

 System temperature information (degree centigrade):

 ----------------------------------------------------------------------

 Chassis  Slot  Sensor    Temperature  Lower  Warning  Alarm  Shutdown

 1        0     inflow  1 35           0      48       60     NA

 1        0     hotspot 1 43           0      80       95     NA

 1        1     inflow  1 34           0      48       60     NA

 1        1     hotspot 1 38           0      80       95     NA

 1        2     hotspot 1 49           0      88       100    110

 1        2     hotspot 1 49           0      88       100    110

 1        3     hotspot 1 43           0      80       97     NA

 1        3     hotspot 2 41           0      80       97     NA

 1        4     hotspot 1 42           0      80       97     NA

 1        4     hotspot 2 40           0      80       97     NA

 1        5     hotspot 1 45           0      80       97     NA

 1        5     hotspot 2 41           0      80       97     NA

 1        6     hotspot 1 53           0      88       100    110

 1        6     hotspot 1 53           0      88       100    110

 1        7     hotspot 1 55           0      88       100    110

 1        7     hotspot 1 55           0      88       100    110

 1        8     hotspot 1 67           0      88       100    110

 1        8     hotspot 1 67           0      88       100    110

 1        9     hotspot 1 61           0      88       100    110

 1        9     hotspot 1 61           0      88       100    110

 2        0     inflow  1 34           0      85       90     NA

 2        0     hotspot 1 42           0      85       90     NA

 2        1     inflow  1 36           0      85       90     NA

 2        1     hotspot 1 41           0      85       90     NA

 2        2     hotspot 1 56           0      88       100    110

 2        2     hotspot 1 56           0      88       100    110

 2        3     hotspot 1 47           0      80       97     NA

 2        3     hotspot 2 44           0      80       97     NA

(5)     如果仍然无法确认故障原因,请搜集信息并发送给H3C技术支持人员协助分析。

1.4  板件内联口故障对单板的影响

1.4.1  故障描述

单板出现异常重启或者下电

1.4.2  故障处理步骤

1. 单板异常重启

查看diagfile.log日志

<Sysname> more diagfile/diagfile.log

查看设备diagfile.log日志,如果有单板间的hg口故障日志,表明单板间的内联口出现故障,而裁决单板重启,若单板重启后,内联口恢复正常,将日志发给技术支持人员进行分析。

%@12527^Dec 19 16:10:56:906 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.

%@12528^Dec 19 16:10:56:640 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=13; Chassis 1 Slot 13 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 8: The source port went down.

%@12529^Dec 19 16:10:57:376 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=11; Chassis 1 Slot 11 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 3: The source port went down.

%@12530^Dec 19 16:10:56:740 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=12; Chassis 1 Slot 12 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 6: The source port went down.

%@12554^Dec 19 16:11:11:959 2020 M9k DRV/3/FAULT_MONITOR_BITMAP:

Fault PhySlot List: 3

Fault Reason BitMap:

slot    :  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

           -----------------------------------------------------

Fabric1 :  5  5  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric2 :  5  5  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric3 :  5  5  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric4 :  5  5  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5

           -----------------------------------------------------

IO board:  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

 

Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL

%@12555^Dec 19 16:11:11:960 2020 M9k DRV/3/FAULT_MONITOR_REBOOT: Chassis 1 Slot 3: The card will be restarted due to a hardware failure.

 

查看logfile.log日志

<Sysname> more logfile/logfile.log

查看设备logfile.log日志,如果有单板间的hg口故障日志,表明单板间的内联口出现故障,而裁决单板重启,若单板重启后,内联口恢复正常,将日志发给技术支持人员进行分析。

%@4387931%Dec 19 16:10:56:906 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The connectivity of the internal port failed.

%@4387932%Dec 19 16:10:56:640 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=13; Chassis 1 Slot 13 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 8: The connectivity of the internal port failed.

%@4387933%Dec 19 16:10:57:376 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=11; Chassis 1 Slot 11 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 3: The connectivity of the internal port failed.

%@4387934%Dec 19 16:10:56:740 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=12; Chassis 1 Slot 12 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 6: The connectivity of the internal port failed.

%@4387947%Dec 19 16:11:11:960 2020 M9k DRV/3/FAULT_MONITOR_REBOOT: Chassis 1 Slot 3: The card will be restarted due to a hardware failure.

%@4387948%Dec 19 16:11:12:151 2020 M9k DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on chassis 1 slot 3, type is NSQM1FWEFGA0.

 

2. 单板异常下电

查看diagfile.log日志

<Sysname> more diagfile/diagfile.log

查看设备diagfile.log日志,如果半小时内出现3次单板间的hg口故障而导致到重启,最后出现“The card will be isolated due to a hardware failure”日志时,表明单板间的内联口出现故障并且无法通过重启恢复,而裁决单板下电隔离,此时可查看日志将hg口故障涉及到单板进行更换,并将日志发给技术支持人员进行分析。

%@12574^Dec 19 17:15:53:091 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.

%@12584^Dec 19 17:23:57:002 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.

%@12605^Dec 19 17:32:34:001 2020 M9k DRV/3/HG_MONITOR_PORT_ERROR: -Chassis=1-Slot=10; Chassis 1 Slot 10 Unit 0 Port 3 to Chassis 1 Slot 3 Unit 0 Port 1: The source port went down.

%@12615^Dec 19 17:32:54:996 2020 M9k DRV/3/FAULT_MONITOR_BITMAP:

Fault PhySlot List: 10

Fault Reason BitMap:

slot    :  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

           -----------------------------------------------------

Fabric1 :  5  5  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric2 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric3 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric4 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

           -----------------------------------------------------

IO board:  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

 

Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL

%@12616^Dec 19 17:32:54:996 2020 M9k DRV/3/FAULT_MONITOR_ISOLATE: Chassis 1 Slot 10: The card will be isolated due to a hardware failure.

 

查看logfile.log日志

<Sysname> more logfile/logfile.log

查看设备logfile.log日志,如果半小时内出现3次单板间的hg口故障而导致到重启,最后出现“The card will be isolated due to a hardware failure”日志时,表明单板间的内联口出现故障并且无法通过重启恢复,而裁决单板下电隔离,此时可查看日志将hg口故障涉及到单板进行更换,并将日志发给技术支持人员进行分析。

1.5  查看主控、业务板卡版本信息

高端防火墙设备M9000系列可以通过任一视图下display devicedis version和Probe视图下display system internal version查看状态正常的设备的版本情况。

1.5.1  故障描述

设备的主控版本、业务版本都应该取自相同的版本路径。否则会导致主控、业务板卡版本不一致,使得业务板卡fault状态(比如主控取45SP分支版本,业务板取32SP分支版本,业务板会fault启动不了)

1.5.2  故障处理步骤

1. 主控、业务板卡取的分支不同导致板卡fault启动不了

如果主控、业务板卡取的分支不同导致板卡fault,启动不了,会有两种情况。一种是板卡一直fault,板卡一直在重启状态;一种是板卡fault卡住不动。针对上述情况,bootlooder和web升级版本解决问题不在适用,只能通过bootware(Ctrl+B)方式升级版本解决板卡fault问题。板卡fault卡住不动需要插拔板卡让板卡重启,手动连续按Ctrl+B,使得板卡进入bootware升级界面,通过bootware方式升级版本解决板卡fault问题。

2. 故障诊断相关命令

<Sysname> display boot-loader

Software images on chassis 1 slot 0:

Current software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

    None

Software images on chassis 1 slot 1:

Current software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

   None

Software images on chassis 1 slot 2.1:

Current software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

Software images on chassis 1 slot 3.1:

Current software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

Software images on chassis 2 slot 0:

Current software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

Software images on chassis 2 slot 1:

Current software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  flash:/M9000-CMW710-BOOT-R9153P22.bin

  flash:/M9000-CMW710-SYSTEM-R9153P22.bin

  flash:/M9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

Software images on chassis 2 slot 2.1:

Current software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

Software images on chassis 2 slot 3.1:

Current software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Main startup software images:

  sda0:/BLADE4FWM9000-CMW710-BOOT-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-SYSTEM-R9153P22.bin

  sda0:/BLADE4FWM9000-CMW710-DEVKIT-R9153P22.bin

Backup startup software images:

  None

1.6  GOLD检测故障处理

1.6.1  gold检测

GOLD(Generic OnLine Diagnostics,通用在线诊断)通过在设备上执行诊断测试例,来发现硬件故障,并进行问题报告。GOLD检查的硬件故障主要包括:端口、内存、芯片、连接、转发路径以及控制路径是否正常等。

1.6.2  故障描述

Hg Monitor主要是监测以网板为中心的内联口(HG/XE)的运行状态,主要包括网板和接口板之间的HG口,网板和业务板前插之间的XE口,以及网板和主控板之间的HG口(针对M9000设备)。

1.6.3  故障处理步骤

针对HG链路中的端口DOWN进行错误探测,通过先重启后隔离的方式,尝试恢复解决问题。基本处理流程如下

·     HG Monitor负责故障监控和故障上报。在监控周期内(10秒),网板的HG Monitor发现故障,向主控GOLD上报故障。

·     GOLD Fault Monitor负责故障仲裁和故障板隔离。主控每10秒检测是否有故障上报。若有故障上报,根据板在位情况仲裁故障板卡并隔离。隔离动作包括三次单板重启和单板下电。三次重启后故障仍然存在,则将裁决对象单板下电隔离。

1. 日志定位

hg-monitor和fault-monitor裁决相关的日志记录在diag日志中,syslog也会记录接口down事件,常见日志打印举例:

hg-monitor检测到有hg口down

syslog:

DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.

%Dec 12 15:46:57:103 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.

%Dec 12 15:46:56:897 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.

%Dec 12 15:46:57:828 2020 M9016_1 DRV/3/FAULT_MONITOR_REBOOT: Slot 10: The card will be restarted due to a hardware failure.

%Dec 12 15:46:58:219 2020 M9016_1 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 10, type is NSQ1FAB08D0.

 

Diag日志:

        %@1211933^Dec 12 15:46:37:402 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The source port went down.

%@1211934^Dec 12 15:46:48:046 2020 M9016_1 DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.

%@1211935^Dec 12 15:46:49:213 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The source port went down.

%@1211936^Dec 12 15:46:57:103 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.

%@1211937^Dec 12 15:46:56:897 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.

%@1211938^Dec 12 15:46:57:827 2020 M9016_1 DRV/3/FAULT_MONITOR_BITMAP:

Fault PhySlot List: 10

Fault Reason BitMap:

slot    :  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

           -----------------------------------------------------

Fabric1 :  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric2 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric3 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric4 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

           -----------------------------------------------------

IO board:  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

 

Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL

 

%@1211939^Dec 12 15:46:57:828 2020 M9016_1 DRV/3/FAULT_MONITOR_REBOOT: Slot 10: The card will be restarted due to a hardware failure

Syslog:

%Dec 12 16:13:34:637 2020 M9016_1 DRV/3/HG_MONITOR_PORT_ERROR: -Slot=10; Slot 10 Unit 0 Port 16 to Slot 1 Unit 0 Port 33: The connectivity of the internal port failed.

%Dec 12 16:13:45:189 2020 M9016_1 DRV/3/HG_MONITOR_BETWEEN_DROP: -Slot=10; Slot 1 Unit 0 Port 33 to Slot 10 Unit 0 Port 16: Packets were dropped in the HG channel.

 

[M9016_1-probe]%Dec 12 16:13:54:437 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=1; Slot 1 Unit 0 Port 33: The internal port went down.

%Dec 12 16:13:53:927 2020 M9016_1 DRV/3/PORT_MONITOR_DOWN: -Slot=10; Slot 10 Unit 0 Port 16: The internal port went down.

%Dec 12 16:13:53:959 2020 M9016_1 DRV/3/FAULT_MONITOR_ISOLATE: Slot 10: The card will be isolated due to a hardware failure.

%Dec 12 16:13:53:999 2020 M9016_1 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 10, type is NSQ1FAB08D0.

 

Diag日志:

%@1211957^Dec 12 16:13:53:958 2020 M9016_1 DRV/3/FAULT_MONITOR_BITMAP:

Fault PhySlot List: 10

Fault Reason BitMap:

slot    :  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

           -----------------------------------------------------

Fabric1 :  5  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric2 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric3 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Fabric4 :  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

           -----------------------------------------------------

IO board:  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

 

Fault Reason: 0-RFCS, 1-RERPKT, 2-DOWN, 3-UNRESP, 4-1bit, 5-NORMAL

 

%@1211958^Dec 12 16:13:53:959 2020 M9016_1 DRV/3/FAULT_MONITOR_ISOLATE: Slot 10: The card will be isolated due to a hardware failure.

2. 解决措施

网板3次裁决重启后,故障仍存在,网板被裁决下电隔离

对故障件进行备件更换

1.6.4  常用诊断命令

·     命令行:display hgmonitor info X Y

·     含义:查看各单板hgmonitor信息,X代表槽位号,Y代表芯片号

·     举例:查看11槽网板0芯片上的hgmonitor信息,包含芯片0上和接口板互联hg口的故障记录,和业务板互联的xe口的故障记录,和主控板互联hg口的故障记录

 

·     命令行:display hgmonitor judge X

·     含义:查看各单板hgmonitor故障统计,X代表槽位号

·     举例:查看13槽网板上的hgmonitor故障统计,可查询到slot 13 unit 0 port 1端口故障

 

·     命令行:debug gold show monitor fault slot X

·     含义:查看当前主控板上隔离的单板数量,X代表主控板槽位号

·     举例:查看5号主主控上隔离的单板数量,最大隔离数为1,当前隔离数为0

 

1.7  故障诊断命令

表1-1 故障诊断命令

命令

说明

display device

显示设备信息,检查各单板的状态是否正常

display environment

显示设备的温度信息,检查环境温度是否正常(是否超出温度告警阈值)

display fan

显示设备内置风扇的工作状态

display power

显示设备上的电源系统信息。详细信息包括下列信息:

·     电源管理使能状态

·     电源类型、额定输入电压和额定输出功率

·     在位电源模块的状态

display version

显示系统版本信息、单板的运行时间以及最后一次重启的原因

save

将当前配置保存到指定文件

temperature-limit

设置设备的温度告警门限

 

不同款型规格的资料略有差异, 详细信息请向具体销售和400咨询。H3C保留在没有任何通知或提示的情况下对资料内容进行修改的权利!

新华三官网
联系我们