02-系统维护类故障案例-新华三集团-H3C

02-系统维护类故障案例

本章节下载 (166.97 KB)

02-系统维护类故障案例

1.1 CPU占用率高

1.1.1 故障描述

设备CPU占用率持续在60%以上，下发命令时设备反应很慢。

<Device> display cpu-usage

Chassis 1 Slot 0 CPU 0 CPU usage:

1% in last 5 seconds

2% in last 1 minute

2% in last 5 minutes

Chassis 1 Slot 4 CPU 0 CPU usage:

1% in last 5 seconds

4% in last 1 minute

4% in last 5 minutes

Chassis 1 Slot 7 CPU 0 CPU usage:

84% in last 5 seconds

27% in last 1 minute

27% in last 5 minutes

Chassis 1 Slot 8 CPU 0 CPU usage:

3% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

Chassis 1 Slot 9 CPU 0 CPU usage:

3% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

Chassis 2 Slot 0 CPU 0 CPU usage:

0% in last 5 seconds

2% in last 1 minute

2% in last 5 minutes

Chassis 2 Slot 4 CPU 0 CPU usage:

0% in last 5 seconds

4% in last 1 minute

4% in last 5 minutes

Chassis 2 Slot 6 CPU 0 CPU usage:

3% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

Chassis 2 Slot 7 CPU 0 CPU usage:

3% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

Chassis 2 Slot 8 CPU 0 CPU usage:

15% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

Chassis 2 Slot 9 CPU 0 CPU usage:

3% in last 5 seconds

6% in last 1 minute

6% in last 5 minutes

这里，是在IRF环境下，查询的CPU利用率信息。Chassis 1和Chassis 2分别表示堆叠的框1和框2。该命令显示的是每个框上每个槽位的CPU利用率。

通过display cpu-usage history可以查看单板最近60分钟的CPU占用情况。如横坐标时间为20，则表示20分钟前的CPU使用率。

<Device> display cpu-usage history

100%|

95%|

90%|

85%|

80%|

75%|

70%|

65%|

60%|

55%|

50%|

45%|

40%|

35%|

30%|

25%|

20%|

15%|

10%|

5%| #

------------------------------------------------------------

10 20 30 40 50 60 (minutes)

cpu-usage (CPU 0) last 60 minutes (SYSTEM)

1.1.2 排查思路

CPU占用率高的排查思路如下

图1-1 CPU占用率高排查思路图

1.1.3 故障处理步骤

CPU占用率高的原因通常有：

· 路由震荡。

· 配置过多的路由策略。

· 报文攻击。

· 链路环路。

· 报文没有走快转。

· 接口没有加入安全域或者没有安全策略，大量报文在设备上丢弃。

· 是否打开了debugging命令。

1. 路由策略排查

通过display route-policy命令可以查看设备配置的路由策略，请检查配置的路由策略是否过多，导致CPU处理的负担增加。

<Device> display route-policy

Route-policy: policy1

permit : 1

if-match cost 10

continue: next node 11

apply comm-list a delete

2. 链路环路排查

链路成环时，网络震荡，大量的协议报文上送CPU处理也可能导致CPU占用率升高。存在环路时流量成环，可能会出现广播，设备很多端口的流量会变得很大，端口使用率达到90％以上：

<Device>display interface Ten-GigabitEthernet6/0/11

Ten-GigabitEthernet6/0/11

Current state: UP

Line protocol state: UP

Description: Ten-GigabitEthernet6/0/11 Interface

Bandwidth: 10000000 kbps

Maximum transmission unit: 1500

Allow jumbo frames to pass

Broadcast max-ratio: 100%

Multicast max-ratio: 100%

Unicast max-ratio: 100%

Internet protocol processing: Disabled

IP packet frame type: Ethernet II, hardware address: 1234-660e-0012

IPv6 packet frame type: Ethernet II, hardware address: 1234-660e-0012

Media type is optical fiber,Port hardware type is 10G_BASE_SR_SFP

Output queue - Urgent queuing: Size/Length/Discards 0/1024/0

Output queue - Protocol queuing: Size/Length/Discards 0/500/0

Output queue - FIFO queuing: Size/Length/Discards 0/75/0

10Gbps-speed mode, Full-duplex mode

Link speed type is autonegotiation, link duplex type is autonegotiation

Flow-control is not enabled

The Maximum Frame Length is 9216

Last link flapping: 1 hours 31 minutes 7 seconds

Last clearing of counters: 09:48:08 Mon 12/28/2020

Current system time:2020-12-28 11:06:14 Beijing+08:00:00

Last time when physical state changed to up:2020-12-28 09:35:07 Beijing+08:00:00

Last time when physical state changed to down:2020-12-28 09:34:55 Beijing+08:00:00

Peak input rate: 29 bytes/sec, at 2020-12-28 09:54:00

Peak output rate: 373 bytes/sec, at 2020-12-28 10:40:17

Last 300 second input: 0 packets/sec 24 bytes/sec 0%

Last 300 second output: 2 packets/sec 212 bytes/sec 0%

Input (total): 785 packets, 116898 bytes

5 unicasts, 0 broadcasts, 780 multicasts, 0 pauses

Input (normal): 785 packets, - bytes

5 unicasts, 0 broadcasts, 780 multicasts, 0 pauses

Input: 0 input errors, 0 runts, 0 giants, 0 throttles

0 CRC, 0 frame, - overruns, 0 aborts

- ignored, - parity errors

Output (total): 10296 packets, 1119042 bytes

772 unicasts, 0 broadcasts, 9524 multicasts, 0 pauses

Output (normal): 10296 packets, - bytes

772 unicasts, 0 broadcasts, 9524 multicasts, 0 pauses

Output: 0 output errors, - underruns, - buffer failures

0 aborts, 0 deferred, 0 collisions, 0 late collisions

0 lost carrier, - no carrier

如链路出现环路：

· 排查链路连接、端口配置是否正确。

· 设备对接的交换机是否使能STP协议，配置是否正确。

· 设备路由是否设置正确，是否存在路由环路。

3. 报文是否走快转排查

可以通过display ip fast-forwarding cache命令来确定报文是否走快转，如果cache表项中不存在某条流，说明报文没有走快转。

<Device> display ip fast-forwarding cache

Total number of fast-forwarding entries: 10

SIP SPort DIP DPort Pro Input_If Output_If Flg

192.168.96.39 162 192.168.210.20 11586 17 M-GE1/0/0/0 InLoop0 1

192.168.96.18 162 192.168.210.20 11585 17 M-GE1/0/0/0 InLoop0 1

192.168.96.16 162 192.168.210.20 11584 17 M-GE1/0/0/0 InLoop0 1

12.1.1.1 3784 12.1.1.2 49216 17 N/A InLoop0 1

192.168.210.20 11585 192.168.96.18 162 17 InLoop0 M-GE1/0/0/0 1

192.168.210.20 11584 192.168.96.16 162 17 InLoop0 M-GE1/0/0/0 1

192.168.210.20 11586 192.168.96.39 162 17 InLoop0 M-GE1/0/0/0 1

12.1.1.2 49216 12.1.1.1 3784 17 InLoop0 N/A 1

192.168.96.40 50356 192.168.210.20 23 6 M-GE1/0/0/0 InLoop0 1

192.168.210.20 23 192.168.96.40 50356 6 InLoop0 M-GE1/0/0/0 1

可以根据某一个地址进行确认以该地址为源或目的IP报文是否走快转，命令如下：

<Device> display ip fast-forwarding cache 12.1.1.1

Total number of fast-forwarding entries: 2

SIP SPort DIP DPort Pro Input_If Output_If Flg

12.1.1.2 49216 12.1.1.1 3784 17 InLoop0 N/A 1

12.1.1.1 3784 12.1.1.2 49216 17 RAGG5.3101 InLoop0 1

4. 如果仍然无法排除故障，请将display cpu-usage命令显示信息及搜集的其它信息反馈给H3C技术支持人员分析。

1.2 内存占用率高

1.2.1 故障描述

多次查看单板内存占用率，发现内存占用率持续偏高，始终处于70％以上（未使用的内存占用率低于30%）。Total表示总的内存，Used表示当前使用的内存，FreeRatio表示未使用的内存占用率。

1.2.2 故障处理步骤

<Device> display memory chassis 1 slot 2

Memory statistics are measured in KB:

Chassis 1 Slot 2:

Total Used Free Shared Buffers Cached FreeRatio

Mem: 984640 313232 671408 0 0 26568 68.2%

-/+ Buffers/Cache: 286664 697976

Swap: 0 0 0

Chassis 1 Slot 2 CPU 1:

Total Used Free Shared Buffers Cached FreeRatio

Mem: 14834944 3342376 11492568 0 600 124500 77.5%

-/+ Buffers/Cache: 3217276 11617668

Swap: 0 0 0

这类问题通常为软件问题引起，如内存泄露，也可能是由于会话数目、路由数目过多导致。请按照下面步骤进一步搜集信息发送给H3C技术支持人员分析。

1. 查询单板各进程的内存使用信息

通过display process memory命令多次查询单板各进程的内存使用信息。Dynamic类型的内存为设备动态申请的，在内存出现泄露时会变得很大，通过前后比较观察可以确认哪个进程的内存占用持续增加。如果持续增加，说明该进程可能发生了泄露，请记录下进程的JID。下面以查询JID为78的diagd进程为例说明。

<Device> display process memory chassis 2 slot 2

JID Text Data Stack Dynamic Name

1 168 604 24 64 scmd

2 0 0 0 0 [kthreadd]

3 0 0 0 0 [ksoftirqd/0]

……

78 112 9368 12 320 diagd

79 76 1040 8 8 mdcagentd

80 116 8860 8 16 fsd

81 140 992 16 212 dbmd

83 72 496 8 20 syslogd

84 168 41980 16 44 drvdiagd

85 172 17112 16 12 devd

94 112 8864 12 12 edev

……

2. 确认哪种字节大小的内存块发生泄露

再进一步确认JID为78的diagd进程的哪种字节大小的内存块发生泄露。如下命令所示，Size表示内存块的字节大小，Total表示总的申请个数，Used表示使用数目，Free表示未使用的数目，Free Ratio表示未使用的内存块百分比。通过多次查询并比较查询值可以看出哪个Size的内存块Used个数持续增加。查询完毕后，请将搜集到的信息发送H3C给技术支持人员分析。

<Device> display process memory heap job 78 verbose

Heap usage:

Size Free Used Total Free Ratio

16 0 385 385 0.0%

24 2 49 51 3.9%

32 0 13 13 0.0%

40 0 7 7 0.0%

64 0 411 411 0.0%

72 0 4 4 0.0%

80 1 0 1 100.0%

96 1 0 1 100.0%

104 0 8 8 0.0%

136 0 8 8 0.0%

152 0 9 9 0.0%

184 0 1 1 0.0%

368 0 8 8 0.0%

3080 0 1 1 0.0%

8200 1 0 1 100.0%

29376 1 0 1 100.0%

Large Memory Usage:

Used Blocks : 24

Used Memory(in bytes): 2031616

Free Blocks : 0

Free Memory(in bytes): 0

Summary:

Total virtual memory heap space(in bytes) : 2113536

Total physical memory heap space(in bytes) : 454656

Total allocated memory(in bytes) : 2075736

1.3 故障诊断命令

表1-1 故障诊断命令

命令	说明
display cpu-usage	显示CPU利用率的统计信息，用于查询CPU占用率高的任务
display cpu-usage history	以图形方式显示CPU利用率统计历史信息
display interface	显示指定接口的信息，检查接口的流量是否正常
display memory	显示单板内存占用率
display process memory	显示单板各进程的内存使用信息，通过多次查询，发现可能存在内存泄露的进程
display process memory heap	显示Dynamic类型内存的详细信息，确认哪种字节大小的内存块发生了泄露
display route-policy	显示设备配置的路由策略，检查配置的路由策略是否过多，导致CPU处理的负担增加

不同款型规格的资料略有差异, 详细信息请向具体销售和400咨询。H3C保留在没有任何通知或提示的情况下对资料内容进行修改的权利!

热门推荐

智能联接

热门推荐

H3C服务器

HPE服务器

热门推荐

H3C存储

HPE存储

热门推荐

商用台式机

商用笔记本

商用显示器

配件

热门推荐

热门推荐

智能终端

技术解决方案

行业解决方案

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

专业安全服务

安全运营服务

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

热门推荐

公司刊物

加入我们

国家/地区

05-系统维护

目录

02-系统维护类故障案例

1.1 CPU占用率高

1.1.1 故障描述

1.1.2 排查思路

1. 路由策略排查

2. 链路环路排查

3. 报文是否走快转排查

4. 如果仍然无法排除故障，请将display cpu-usage命令显示信息及搜集的其它信息反馈给H3C技术支持人员分析。

1.2 内存占用率高

联系我们