Table of Contents

Related Documents

01-Hardware Resource Management Troubleshooting Guide

Title	Size	Download
01-Hardware Resource Management Troubleshooting Guide	118.60 KB

Troubleshooting system management

Hardware resource management issues

High CPU usage

Symptom

If one of the following conditions occurs, the CPU control core usage of the device is high, and you must identify the causes for the high CPU usage:

· During daily inspection of the device, execute the display cpu-usage command repeatedly to view the CPU usage. The CPU usage is significantly higher than the daily average.

# Execute the display cpu-usage summary command to view the average CPU usage during the most recent 5-second, 1-minute, or 5-minute interval.

<Sysname> display cpu-usage summary

Slot CPU Last 5 sec Last 1 min Last 5 min

1 0 5% 5% 4%

# Execute the display cpu-usage history command to view the CPU usage in graphical form for the last 60 samples. The data shows that the CPU usage rate continues to increase or is significantly higher than the daily average value.

· When you log in to the device via Telnet or SSH and execute commands, the device responds slowly and experiences stagnation.

· The device outputs log messages about high CPU usage on the device.

· Alarms on high CPU usage occur on the SNMP manager.

Common causes

The following are the common causes of this type of issue:

· Network attacks.

· Protocol flappings, typically route protocol flappings.

· After flow sampling is configured on the device and the traffic to be processed is too large or the device's sampling frequency is too high, the sampling feature occupies a significant amount of CPU resources.

· The device generates a large number of log messages. Then, abundant resources are occupied for the generation and management of these log messages.

Troubleshooting flow

Figure 1 shows the troubleshooting flowchart.

Figure 1 Flowchart for troubleshooting high CPU usage

‌

Solution

1. Identify whether a network attack occurs.

On a live network, the most common cause of high CPU usage is a network attack. Attackers initiate a large number of abnormal network interactions which hit the device. For example, the attackers transmit a large number request messages for setting up TCP connections or ICMP request messages in a short period. Then, the device is busy processing these attack messages, leading to high CPU usage and subsequently affecting the normal operation of the device.

In probe view, execute the display system internal control-plane statistics command to view statistics of the control plane to check the number of dropped messages. If the current CPU usage is high and the Dropped field value is large, a message attack occurs probably on the device.

<Sysname> display system internal control-plane statistics slot 1

Control plane slot 1

Protocol: Default

Bandwidth: 15360 (pps)

Forwarded: 108926 (Packets), 29780155 (Bytes)

Dropped : 0 (Packets), 0 (Bytes)

Protocol: ARP

Bandwidth: 512 (pps)

Forwarded: 1489284 (Packets), 55318920 (Bytes)

Dropped : 122114 (Packets), 491421 (Bytes)

...

¡ If a network attack occurs, first resolve the network attack issue.

¡ If no network attack occurs, proceed to step 2.

2. Identify whether a protocol flapping occurs on the device.

A protocol flapping can cause continuous processing of protocol messages, topology calculations, and entry updates by the device, resulting in high CPU usage. In practical applications, the most common protocol flappings are OSPF protocol flappings.

¡ For OSPF flappings, execute the display ip routing-table command to view routing information. If route entries for the same network segment are frequently and repeatedly created and deleted in the routing table, a route flapping occurs.

- If a route flapping occurs or the routes do not exist, troubleshoot link-related issues and IGP routing issues.

- If no route flapping occurs, proceed to step 3.

3. Identify whether flow statistics and sampling features are configured and whether the configured parameters are appropriate.

After network traffic monitoring features including NetStream and sFlow are configured on the device, the device will statistically analyze network traffic. If network traffic is high, the CPU usage might be high. In this case, perform the following operations:

¡ Configure filter conditions to precisely filter the traffic, and only analyze the traffic that users care about.

¡ Configure the sampler and adjust the sampling ratio. Then, the statistics collected by NetStream and sFlow can basically reflect the status of the entire network, and can prevent the excessive statistical messages from affecting the forwarding performance of the device.

4. Identify whether the device is generating a large number of log messages.

In certain anomaly situations, for example, when the device is under attack, errors occur during the operation, or a port frequently runs up/down, the device continuously generates diagnostic information or log information. In this case, the system software needs to frequently read and write to the memory, which can increase the CPU usage.

Use the following methods to identify whether the device is generating a large number of log messages:

¡ Log in to the device via Telnet and execute the terminal monitor command to enable log output to the current terminal.

<Sysname> terminal monitor

The current terminal is enabled to display logs.

After you execute this command, if a large number of abnormal log messages or duplicated log messages are output to the CLI, the device is generating a large number of log messages.

¡ Repeatedly execute the display logbuffer summary command. If the total log volume increases obviously, execute the display logbuffer reverse command to view detailed log information to identify whether a large number of abnormal log messages occur or whether a particular log message is repeatedly appearing in large quantities.

<Sysname> display logbuffer summary

Slot EMERG ALERT CRIT ERROR WARN NOTIF INFO DEBUG

1 0 0 2 9 24 12 128 0

5 0 0 0 41 72 8 2 0

97 0 0 42 11 14 7 40 0

<Sysname> display logbuffer reverse

Log buffer: Enabled

Max buffer size: 1024

Actual buffer size: 512

Dropped messages: 0

Overwritten messages: 0

Current messages: 410

%Jan 15 08:17:24:259 2021 Sysname SHELL/6/SHELL_CMD: -Line=vty0-IPAddr=192.168.2.108-User=**; Command is display logbuffer

%Jan 15 08:17:19:743 2021 Sysname SHELL/4/SHELL_CMD_MATCHFAIL: -User=**-IPAddr=192.168.2.108; Command display logfile in view shell failed to be matched.

...

If the device is generating a large number of log messages, use the following methods to reduce log generation:

¡ Disable the log output feature for some service modules.

¡ Execute the info-center logging suppress command to disable log output for a module.

¡ Execute the info-center logging suppress duplicates command to enable duplicate log suppression.

If the device has not generated a large number of log messages, proceed to step 6.

5. Collect CPU usage information, and identify the service modules where the CPU usage is high.

a. Identify the tasks that are consuming high CPU usage.

# Execute the display process cpu to view tasks that occupy most CPU resources within a period. This example displays information about slot 1.

<Sysname> display process cpu slot 1

CPU utilization in 5 secs: 0.4%; 1 min: 0.2%; 5 mins: 0.2%

JID 5Sec 1Min 5Min Name

1 0.0% 0.0% 0.0% scmd

2 5.5% 5.1% 5.0% [kthreadd]

3 0.0% 0.0% 0.0% [ksoftirqd/0]

...

If a process has a CPU usage higher than 3% (for reference only), further location is required for that process.

# Execute the monitor process dumbtty command to view the real-time CPU usage of a process. This example displays information about CPU 0 for slot 1.

<Sysname> system-view

[Sysname] monitor process dumbtty slot 1 cpu 0

206 processes; 342 threads; 5134 fds

Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

CPU0: 99.04% idle, 0.00% user, 0.96% kernel, 0.00% interrupt, 0.00% steal

CPU1: 98.06% idle, 0.00% user, 1.94% kernel, 0.00% interrupt, 0.00% steal

CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

Memory: 7940M total, 5273M available, page size 4K

JID PID PRI State FDs MEM HH:MM:SS CPU Name

322 322 115 R 0 0K 01:48:03 20.02% [kdrvfwdd2]

323 323 115 R 0 0K 01:48:03 20.02% [kdrvfwdd3]

324 324 115 R 0 0K 01:48:03 20.02% [kdrvfwdd4]

376 376 120 S 22 159288K 00:00:07 0.37% diagd

1 1 120 S 18 30836K 00:00:02 0.18% scmd

379 379 120 S 22 173492K 00:00:11 0.18% devd

2 2 120 S 0 0K 00:00:00 0.00% [kthreadd]

3 3 120 S 0 0K 00:00:02 0.00% [ksoftirqd/0]

...

- In the output from the monitor process dumbtty command, find the JIDs of processes with CPU usage higher than 3% (for reference only). Then, execute the display process job command for these processes to collect detailed information about the processes, and identify whether the processes are running on the control core.

If the LAST_CPU field value in the output from the display process job command is the ID of the control core (for example, 0 and 1), a process is running on the CPU control core and further location is required. If the LAST_CPU field value is not the ID of the control core, a process is running on the CPU forwarding core, In this case, no action is required and proceed to step 7. Take the pppd process as an example. The output shows that this process contains multiple threads, all of which are running on the control core.

<Sysname> display process name pppd

Job ID: 515

PID: 515

Parent JID: 1

Parent PID: 1

Executable path: /sbin/pppd

Instance: 0

Respawn: ON

Respawn count: 1

Max. spawns per minute: 12

Last started: Wed Nov 3 09:52:00 2021

Process state: sleeping

Max. core: 1

ARGS: --MaxTotalLimit=2000000 --MaxIfLimit=65534 --CmdOption=0x01047fbf --bSaveRunDb --pppoechastenflag=1 --pppoechastennum=6 --pppoechastenperiod=60 --pppoechastenblocktime=300 --pppchastenflag=1 --pppchastennum=6 --pppchastenperiod=60 --pppchastenblocktime=300 --PppoeKChasten --bSoftRateLimit --RateLimitToken=2048

TID LAST_CPU Stack PRI State HH:MM:SS:MSEC Name

515 0 136K 115 S 0:0:0:90 pppd

549 0 136K 115 S 0:0:0:0 ppp_misc

557 0 136K 115 S 0:0:0:10 ppp_chasten

610 0 136K 115 S 0:0:0:0 ppp_work0

611 1 136K 115 S 0:0:0:0 ppp_work1

612 1 136K 115 S 0:0:0:0 ppp_work2

613 1 136K 115 S 0:0:0:0 mp_main

618 1 136K 115 S 0:0:0:110 pppoes_main

619 1 136K 115 S 0:0:0:100 pppoes_mesh

620 1 136K 115 S 0:0:0:120 l2tp_mesh

621 1 136K 115 S 0:0:0:20 l2tp_main

- For a process running on the control core with CPU usage higher than 5%, check the Name field value to identify whether the process is a user-mode process.

If the Name field for a process includes square brackets ([ ]), the process is a kernel thread, and you do not need to execute the monitor thread dumbtty command. If the Name field for a process does not include square brackets ([ ]), the process is a user process and might contain multiple threads. For user processes with multithreading, execute the monitor thread dumbtty command. If the LAST_CPU field of a thread in the output corresponds to the ID of the CPU control core, and the CPU field value is greater than 5%, this thread might cause high CPU core usage. Then, further location is required.

<Sysname> monitor thread dumbtty slot 1 cpu 0

206 processes; 342 threads; 5134 fds

Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

CPU0: 98.06% idle, 0.97% user, 0.97% kernel, 0.00% interrupt, 0.00% steal

CPU1: 97.12% idle, 0.96% user, 0.96% kernel, 0.96% interrupt, 0.00% steal

CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

Memory: 7940M total, 5315M available, page size 4K

JID TID LAST_CPU PRI State HH:MM:SS MAX CPU Name

322 322 2 115 R 00:04:21 0 20.15% [kdrvfwdd2]

323 323 3 115 R 00:04:21 0 20.15% [kdrvfwdd3]

324 324 4 115 R 00:04:21 0 20.15% [kdrvfwdd4]

1 1 1 120 S 00:00:02 21 0.19% scmd

376 376 1 120 S 00:00:00 1 0.19% diagd

2 2 0 120 S 00:00:00 0 0.00% [kthreadd]

...

b. Identify the stacks of an abnormal task.

Execute the follow job command in probe view to identify the stacks of an abnormal task. The following takes the pppd process (with process ID 515) in slot 1 on the device as an example.

<Sysname> system-view

[Sysname] probe

[Sysname-probe] follow job 515 slot 1

Attaching to process 515 (pppd)

Iteration 1 of 5

------------------------------

Thread LWP 515:

Switches: 3205

User stack:

#0 0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1 0x0000000000441745 in ppp_EpollSched+0x35/0x5c

#2 0x0000000000000004 in ??

Kernel stack:

[<ffffffff811f0573>] ep_poll+0x2f3/0x370

[<ffffffff811f06c0>] SyS_epoll_wait+0xd0/0xe0

[<ffffffff814aed79>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

Thread LWP 549:

Switches: 20

User stack:

#0 0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1 0x00000000004435d4 in ppp_misc_EpollSched+0x44/0x6c

Kernel stack:

[<ffffffffffffffff>] 0xffffffffffffffff

...

c. Identify the task name based on steps a and b, and then find the corresponding service module according to the task name to locate and resolve issues in the service module. For example, if the CPU usage of the snmpd task is high, an SNMP attack might occur, or the NMS frequently accesses the device. Then, further troubleshooting is required for the SNMP service module. If the CPU usage of the nqad task is high, the NQA detection might be performed too frequently. Then, further troubleshooting is required for the NQA service module.

6. If the issue persists, collect the following information and contact Technical Support:

¡ Results of each step.

¡ The configuration files, log messages, and alarm messages.

Related alarm and log messages

Alarm messages

· hh3cEntityExtCpuUsageThresholdNotfication

· hh3cEntityExtCpuUsageThresholdRecover

· hh3cCpuUsageSevereNotification

· hh3cCpuUsageSevereRecoverNotification

· hh3cCpuUsageMinorNotification

· hh3cCpuUsageMinorRecoverNotification

Log messages

· DIAG/5/CPU_MINOR_RECOVERY

· DIAG/4/CPU_MINOR_THRESHOLD

· DIAG/5/CPU_SEVERE_RECOVERY

· DIAG/3/CPU_SEVERE_THRESHOLD

01-Fundamentals

Troubleshooting system management

Hardware resource management issues

High CPU usage

Symptom

Common causes

Troubleshooting flow

Solution

Related alarm and log messages

Alarm messages

Log messages

Cloud & AI

Intelligent Connection

Intelligent Computing

Intelligent Storage

Security

SMB Products

Intelligent Terminal Products

Industry Solutions

Product Support Services

Technical Service Solutions

Resource Center

Policy

Online Help

Technical Blogs

Training & Certification

Become a Partner

Partner Policy & Program

Global Learning

Partner Sales Resources

Partner Business Management

Service Business

Profile

News & Events

Online Exhibition Center

Contact Us