02-系统维护类故障案例
本章节下载 (166.97 KB)
设备CPU占用率持续在60%以上,下发命令时设备反应很慢。
<Device> display cpu-usage
Chassis 1 Slot 0 CPU 0 CPU usage:
1% in last 5 seconds
2% in last 1 minute
2% in last 5 minutes
Chassis 1 Slot 4 CPU 0 CPU usage:
1% in last 5 seconds
4% in last 1 minute
4% in last 5 minutes
Chassis 1 Slot 7 CPU 0 CPU usage:
84% in last 5 seconds
27% in last 1 minute
27% in last 5 minutes
Chassis 1 Slot 8 CPU 0 CPU usage:
3% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
Chassis 1 Slot 9 CPU 0 CPU usage:
3% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
Chassis 2 Slot 0 CPU 0 CPU usage:
0% in last 5 seconds
2% in last 1 minute
2% in last 5 minutes
Chassis 2 Slot 4 CPU 0 CPU usage:
0% in last 5 seconds
4% in last 1 minute
4% in last 5 minutes
Chassis 2 Slot 6 CPU 0 CPU usage:
3% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
Chassis 2 Slot 7 CPU 0 CPU usage:
3% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
Chassis 2 Slot 8 CPU 0 CPU usage:
15% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
Chassis 2 Slot 9 CPU 0 CPU usage:
3% in last 5 seconds
6% in last 1 minute
6% in last 5 minutes
这里,是在IRF环境下,查询的CPU利用率信息。Chassis 1和Chassis 2分别表示堆叠的框1和框2。该命令显示的是每个框上每个槽位的CPU利用率。
通过display cpu-usage history可以查看单板最近60分钟的CPU占用情况。如横坐标时间为20,则表示20分钟前的CPU使用率。
<Device> display cpu-usage history
100%|
95%|
90%|
85%|
80%|
75%|
70%|
65%|
60%|
55%|
50%|
45%|
40%|
35%|
30%|
25%|
20%|
15%|
10%|
5%| #
------------------------------------------------------------
10 20 30 40 50 60 (minutes)
cpu-usage (CPU 0) last 60 minutes (SYSTEM)
CPU占用率高的排查思路如下
图1-1 CPU占用率高排查思路图
CPU占用率高的原因通常有:
· 路由震荡。
· 配置过多的路由策略。
· 报文攻击。
· 链路环路。
· 报文没有走快转。
· 接口没有加入安全域或者没有安全策略,大量报文在设备上丢弃。
· 是否打开了debugging命令。
通过display route-policy命令可以查看设备配置的路由策略,请检查配置的路由策略是否过多,导致CPU处理的负担增加。
<Device> display route-policy
Route-policy: policy1
permit : 1
if-match cost 10
continue: next node 11
apply comm-list a delete
链路成环时,网络震荡,大量的协议报文上送CPU处理也可能导致CPU占用率升高。存在环路时流量成环,可能会出现广播,设备很多端口的流量会变得很大,端口使用率达到90%以上:
<Device>display interface Ten-GigabitEthernet6/0/11
Ten-GigabitEthernet6/0/11
Current state: UP
Line protocol state: UP
Description: Ten-GigabitEthernet6/0/11 Interface
Bandwidth: 10000000 kbps
Maximum transmission unit: 1500
Allow jumbo frames to pass
Broadcast max-ratio: 100%
Multicast max-ratio: 100%
Unicast max-ratio: 100%
Internet protocol processing: Disabled
IP packet frame type: Ethernet II, hardware address: 1234-660e-0012
IPv6 packet frame type: Ethernet II, hardware address: 1234-660e-0012
Media type is optical fiber,Port hardware type is 10G_BASE_SR_SFP
Output queue - Urgent queuing: Size/Length/Discards 0/1024/0
Output queue - Protocol queuing: Size/Length/Discards 0/500/0
Output queue - FIFO queuing: Size/Length/Discards 0/75/0
10Gbps-speed mode, Full-duplex mode
Link speed type is autonegotiation, link duplex type is autonegotiation
Flow-control is not enabled
The Maximum Frame Length is 9216
Last link flapping: 1 hours 31 minutes 7 seconds
Last clearing of counters: 09:48:08 Mon 12/28/2020
Current system time:2020-12-28 11:06:14 Beijing+08:00:00
Last time when physical state changed to up:2020-12-28 09:35:07 Beijing+08:00:00
Last time when physical state changed to down:2020-12-28 09:34:55 Beijing+08:00:00
Peak input rate: 29 bytes/sec, at 2020-12-28 09:54:00
Peak output rate: 373 bytes/sec, at 2020-12-28 10:40:17
Last 300 second input: 0 packets/sec 24 bytes/sec 0%
Last 300 second output: 2 packets/sec 212 bytes/sec 0%
Input (total): 785 packets, 116898 bytes
5 unicasts, 0 broadcasts, 780 multicasts, 0 pauses
Input (normal): 785 packets, - bytes
5 unicasts, 0 broadcasts, 780 multicasts, 0 pauses
Input: 0 input errors, 0 runts, 0 giants, 0 throttles
0 CRC, 0 frame, - overruns, 0 aborts
- ignored, - parity errors
Output (total): 10296 packets, 1119042 bytes
772 unicasts, 0 broadcasts, 9524 multicasts, 0 pauses
Output (normal): 10296 packets, - bytes
772 unicasts, 0 broadcasts, 9524 multicasts, 0 pauses
Output: 0 output errors, - underruns, - buffer failures
0 aborts, 0 deferred, 0 collisions, 0 late collisions
0 lost carrier, - no carrier
如链路出现环路:
· 排查链路连接、端口配置是否正确。
· 设备对接的交换机是否使能STP协议,配置是否正确。
· 设备路由是否设置正确,是否存在路由环路。
可以通过display ip fast-forwarding cache命令来确定报文是否走快转,如果cache表项中不存在某条流,说明报文没有走快转。
<Device> display ip fast-forwarding cache
Total number of fast-forwarding entries: 10
SIP SPort DIP DPort Pro Input_If Output_If Flg
192.168.96.39 162 192.168.210.20 11586 17 M-GE1/0/0/0 InLoop0 1
192.168.96.18 162 192.168.210.20 11585 17 M-GE1/0/0/0 InLoop0 1
192.168.96.16 162 192.168.210.20 11584 17 M-GE1/0/0/0 InLoop0 1
12.1.1.1 3784 12.1.1.2 49216 17 N/A InLoop0 1
192.168.210.20 11585 192.168.96.18 162 17 InLoop0 M-GE1/0/0/0 1
192.168.210.20 11584 192.168.96.16 162 17 InLoop0 M-GE1/0/0/0 1
192.168.210.20 11586 192.168.96.39 162 17 InLoop0 M-GE1/0/0/0 1
12.1.1.2 49216 12.1.1.1 3784 17 InLoop0 N/A 1
192.168.96.40 50356 192.168.210.20 23 6 M-GE1/0/0/0 InLoop0 1
192.168.210.20 23 192.168.96.40 50356 6 InLoop0 M-GE1/0/0/0 1
可以根据某一个地址进行确认以该地址为源或目的IP报文是否走快转,命令如下:
<Device> display ip fast-forwarding cache 12.1.1.1
Total number of fast-forwarding entries: 2
SIP SPort DIP DPort Pro Input_If Output_If Flg
12.1.1.2 49216 12.1.1.1 3784 17 InLoop0 N/A 1
12.1.1.1 3784 12.1.1.2 49216 17 RAGG5.3101 InLoop0 1
多次查看单板内存占用率,发现内存占用率持续偏高,始终处于70%以上(未使用的内存占用率低于30%)。Total表示总的内存,Used表示当前使用的内存,FreeRatio表示未使用的内存占用率。
<Device> display memory chassis 1 slot 2
Memory statistics are measured in KB:
Chassis 1 Slot 2:
Total Used Free Shared Buffers Cached FreeRatio
Mem: 984640 313232 671408 0 0 26568 68.2%
-/+ Buffers/Cache: 286664 697976
Swap: 0 0 0
Chassis 1 Slot 2 CPU 1:
Total Used Free Shared Buffers Cached FreeRatio
Mem: 14834944 3342376 11492568 0 600 124500 77.5%
-/+ Buffers/Cache: 3217276 11617668
Swap: 0 0 0
这类问题通常为软件问题引起,如内存泄露,也可能是由于会话数目、路由数目过多导致。请按照下面步骤进一步搜集信息发送给H3C技术支持人员分析。
通过display process memory命令多次查询单板各进程的内存使用信息。Dynamic类型的内存为设备动态申请的,在内存出现泄露时会变得很大,通过前后比较观察可以确认哪个进程的内存占用持续增加。如果持续增加,说明该进程可能发生了泄露,请记录下进程的JID。下面以查询JID为78的diagd进程为例说明。
<Device> display process memory chassis 2 slot 2
JID Text Data Stack Dynamic Name
1 168 604 24 64 scmd
2 0 0 0 0 [kthreadd]
3 0 0 0 0 [ksoftirqd/0]
……
78 112 9368 12 320 diagd
79 76 1040 8 8 mdcagentd
80 116 8860 8 16 fsd
81 140 992 16 212 dbmd
83 72 496 8 20 syslogd
84 168 41980 16 44 drvdiagd
85 172 17112 16 12 devd
94 112 8864 12 12 edev
……
再进一步确认JID为78的diagd进程的哪种字节大小的内存块发生泄露。如下命令所示,Size表示内存块的字节大小,Total表示总的申请个数,Used表示使用数目,Free表示未使用的数目,Free Ratio表示未使用的内存块百分比。通过多次查询并比较查询值可以看出哪个Size的内存块Used个数持续增加。查询完毕后,请将搜集到的信息发送H3C给技术支持人员分析。
<Device> display process memory heap job 78 verbose
Heap usage:
Size Free Used Total Free Ratio
16 0 385 385 0.0%
24 2 49 51 3.9%
32 0 13 13 0.0%
40 0 7 7 0.0%
64 0 411 411 0.0%
72 0 4 4 0.0%
80 1 0 1 100.0%
96 1 0 1 100.0%
104 0 8 8 0.0%
136 0 8 8 0.0%
152 0 9 9 0.0%
184 0 1 1 0.0%
368 0 8 8 0.0%
3080 0 1 1 0.0%
8200 1 0 1 100.0%
29376 1 0 1 100.0%
Large Memory Usage:
Used Blocks : 24
Used Memory(in bytes): 2031616
Free Blocks : 0
Free Memory(in bytes): 0
Summary:
Total virtual memory heap space(in bytes) : 2113536
Total physical memory heap space(in bytes) : 454656
Total allocated memory(in bytes) : 2075736
表1-1 故障诊断命令
命令 |
说明 |
display cpu-usage |
显示CPU利用率的统计信息,用于查询CPU占用率高的任务 |
display cpu-usage history |
以图形方式显示CPU利用率统计历史信息 |
display interface |
显示指定接口的信息,检查接口的流量是否正常 |
display memory |
显示单板内存占用率 |
display process memory |
显示单板各进程的内存使用信息,通过多次查询,发现可能存在内存泄露的进程 |
display process memory heap |
显示Dynamic类型内存的详细信息,确认哪种字节大小的内存块发生了泄露 |
display route-policy |
显示设备配置的路由策略,检查配置的路由策略是否过多,导致CPU处理的负担增加 |
不同款型规格的资料略有差异, 详细信息请向具体销售和400咨询。H3C保留在没有任何通知或提示的情况下对资料内容进行修改的权利!