01-Fundamentals

HomeSupportRoutersSR6600-M SeriesDiagnose & MaintainTroubleshootingH3C SR6608-M Router Troubleshooting Guide(V9)-R9141-6W10001-Fundamentals
01-Hardware Resource Management Troubleshooting Guide

Troubleshooting system management

Hardware resource management issues

High CPU usage

Symptom

If one of the following conditions occurs, the CPU control core usage of the device is high, and you must identify the causes for the high CPU usage:

·     During daily inspection of the device, execute the display cpu-usage command repeatedly to view the CPU usage. The CPU usage is significantly higher than the daily average.

# Execute the display cpu-usage summary command to view the average CPU usage during the most recent 5-second, 1-minute, or 5-minute interval.

<Sysname> display cpu-usage summary

Slot CPU        Last 5 sec        Last 1 min        Last 5 min

1    0          5%                5              4%

# Execute the display cpu-usage history command to view the CPU usage in graphical form for the last 60 samples. The data shows that the CPU usage rate continues to increase or is significantly higher than the daily average value.

·     When you log in to the device via Telnet or SSH and execute commands, the device responds slowly and experiences stagnation.

·     The device outputs log messages about high CPU usage on the device.

·     Alarms on high CPU usage occur on the SNMP manager.

Common causes

The following are the common causes of this type of issue:

·     Network attacks.

·     Protocol flappings, typically route protocol flappings.

·     After flow sampling is configured on the device and the traffic to be processed is too large or the device's sampling frequency is too high, the sampling feature occupies a significant amount of CPU resources.

·     The device generates a large number of log messages. Then, abundant resources are occupied for the generation and management of these log messages.

Troubleshooting flow

Figure 1 shows the troubleshooting flowchart.

Figure 1 Flowchart for troubleshooting high CPU usage

Solution

1.     Identify whether a network attack occurs.

On a live network, the most common cause of high CPU usage is a network attack. Attackers initiate a large number of abnormal network interactions which hit the device. For example, the attackers transmit a large number request messages for setting up TCP connections or ICMP request messages in a short period. Then, the device is busy processing these attack messages, leading to high CPU usage and subsequently affecting the normal operation of the device.

In probe view, execute the display system internal control-plane statistics command to view statistics of the control plane to check the number of dropped messages. If the current CPU usage is high and the Dropped field value is large, a message attack occurs probably on the device.

<Sysname> display system internal control-plane statistics slot 1

Control plane slot 1

  Protocol: Default

    Bandwidth: 15360 (pps)

    Forwarded: 108926 (Packets), 29780155 (Bytes)

    Dropped  : 0 (Packets), 0 (Bytes)

  Protocol: ARP

    Bandwidth: 512 (pps)

    Forwarded: 1489284 (Packets), 55318920 (Bytes)

    Dropped  : 122114 (Packets), 491421 (Bytes)

...

¡     If a network attack occurs, first resolve the network attack issue.

¡     If no network attack occurs, proceed to step 2.

2.     Identify whether a protocol flapping occurs on the device.

A protocol flapping can cause continuous processing of protocol messages, topology calculations, and entry updates by the device, resulting in high CPU usage. In practical applications, the most common protocol flappings are OSPF protocol flappings.

¡     For OSPF flappings, execute the display ip routing-table command to view routing information. If route entries for the same network segment are frequently and repeatedly created and deleted in the routing table, a route flapping occurs.

-     If a route flapping occurs or the routes do not exist, troubleshoot link-related issues and IGP routing issues.

-     If no route flapping occurs, proceed to step 3.

3.     Identify whether flow statistics and sampling features are configured and whether the configured parameters are appropriate.

After network traffic monitoring features including NetStream and sFlow are configured on the device, the device will statistically analyze network traffic. If network traffic is high, the CPU usage might be high. In this case, perform the following operations:

¡     Configure filter conditions to precisely filter the traffic, and only analyze the traffic that users care about.

¡     Configure the sampler and adjust the sampling ratio. Then, the statistics collected by NetStream and sFlow can basically reflect the status of the entire network, and can prevent the excessive statistical messages from affecting the forwarding performance of the device.

4.     Identify whether the device is generating a large number of log messages.

In certain anomaly situations, for example, when the device is under attack, errors occur during the operation, or a port frequently runs up/down, the device continuously generates diagnostic information or log information. In this case, the system software needs to frequently read and write to the memory, which can increase the CPU usage.

Use the following methods to identify whether the device is generating a large number of log messages:

¡     Log in to the device via Telnet and execute the terminal monitor command to enable log output to the current terminal.

<Sysname> terminal monitor

The current terminal is enabled to display logs.

After you execute this command, if a large number of abnormal log messages or duplicated log messages are output to the CLI, the device is generating a large number of log messages.

¡     Repeatedly execute the display logbuffer summary command. If the total log volume increases obviously, execute the display logbuffer reverse command to view detailed log information to identify whether a large number of abnormal log messages occur or whether a particular log message is repeatedly appearing in large quantities.

<Sysname> display logbuffer summary

  Slot EMERG ALERT  CRIT ERROR  WARN NOTIF  INFO DEBUG

     1     0     0     2     9    24    12   128     0

     5     0     0     0    41    72     8     2     0

    97     0     0    42    11    14     7    40     0

<Sysname> display logbuffer reverse

Log buffer: Enabled

Max buffer size: 1024

Actual buffer size: 512

Dropped messages: 0

Overwritten messages: 0

Current messages: 410

%Jan 15 08:17:24:259 2021 Sysname SHELL/6/SHELL_CMD: -Line=vty0-IPAddr=192.168.2.108-User=**; Command is display logbuffer

%Jan 15 08:17:19:743 2021 Sysname SHELL/4/SHELL_CMD_MATCHFAIL: -User=**-IPAddr=192.168.2.108; Command display logfile in view shell failed to be matched.

...

If the device is generating a large number of log messages, use the following methods to reduce log generation:

¡     Disable the log output feature for some service modules.

¡     Execute the info-center logging suppress command to disable log output for a module.

¡     Execute the info-center logging suppress duplicates command to enable duplicate log suppression.

If the device has not generated a large number of log messages, proceed to step 6.

5.     Collect CPU usage information, and identify the service modules where the CPU usage is high.

a.     Identify the tasks that are consuming high CPU usage.

# Execute the display process cpu to view tasks that occupy most CPU resources within a period. This example displays information about slot 1.

<Sysname> display process cpu slot 1

CPU utilization in 5 secs: 0.4%; 1 min: 0.2%; 5 mins: 0.2%

    JID      5Sec      1Min      5Min    Name

      1      0.0%      0.0%      0.0%    scmd

      2      5.5%      5.1%      5.0%    [kthreadd]

      3      0.0%      0.0%      0.0%    [ksoftirqd/0]

...

If a process has a CPU usage higher than 3% (for reference only), further location is required for that process.

# Execute the monitor process dumbtty command to view the real-time CPU usage of a process. This example displays information about CPU 0 for slot 1.

<Sysname> system-view

[Sysname] monitor process dumbtty slot 1 cpu 0

206 processes; 342 threads; 5134 fds

Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

CPU0: 99.04% idle, 0.00% user, 0.96% kernel, 0.00% interrupt, 0.00% steal

CPU1: 98.06% idle, 0.00% user, 1.94% kernel, 0.00% interrupt, 0.00% steal

CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

Memory: 7940M total, 5273M available, page size 4K

        JID        PID  PRI State  FDs     MEM  HH:MM:SS    CPU   Name

        322        322  115   R     0       0K  01:48:03  20.02%  [kdrvfwdd2]

        323        323  115   R     0       0K  01:48:03  20.02%  [kdrvfwdd3]

        324        324  115   R     0       0K  01:48:03  20.02%  [kdrvfwdd4]

        376        376  120   S    22  159288K  00:00:07   0.37%  diagd

          1          1  120   S    18   30836K  00:00:02   0.18%  scmd

        379        379  120   S    22  173492K  00:00:11   0.18%  devd

          2          2  120   S     0       0K  00:00:00   0.00%  [kthreadd]

          3          3  120   S     0       0K  00:00:02   0.00%  [ksoftirqd/0]

...

-     In the output from the monitor process dumbtty command, find the JIDs of processes with CPU usage higher than 3% (for reference only). Then, execute the display process job command for these processes to collect detailed information about the processes, and identify whether the processes are running on the control core.

If the LAST_CPU field value in the output from the display process job command is the ID of the control core (for example, 0 and 1), a process is running on the CPU control core and further location is required. If the LAST_CPU field value is not the ID of the control core, a process is running on the CPU forwarding core, In this case, no action is required and proceed to step 7. Take the pppd process as an example. The output shows that this process contains multiple threads, all of which are running on the control core.

   <Sysname> display process name pppd

                                Job ID: 515

                                   PID: 515

                            Parent JID: 1

                            Parent PID: 1

                       Executable path: /sbin/pppd

                              Instance: 0

                               Respawn: ON

                         Respawn count: 1

                Max. spawns per minute: 12

                          Last started: Wed Nov  3 09:52:00 2021

                         Process state: sleeping

                             Max. core: 1

                                  ARGS: --MaxTotalLimit=2000000 --MaxIfLimit=65534 --CmdOption=0x01047fbf --bSaveRunDb --pppoechastenflag=1 --pppoechastennum=6 --pppoechastenperiod=60 --pppoechastenblocktime=300 --pppchastenflag=1 --pppchastennum=6 --pppchastenperiod=60 --pppchastenblocktime=300 --PppoeKChasten --bSoftRateLimit --RateLimitToken=2048

       TID  LAST_CPU    Stack      PRI    State   HH:MM:SS:MSEC  Name

       515      0        136K      115      S     0:0:0:90       pppd

       549      0        136K      115      S     0:0:0:0        ppp_misc

       557      0        136K      115      S     0:0:0:10       ppp_chasten

       610      0        136K      115      S     0:0:0:0        ppp_work0

       611      1        136K      115      S     0:0:0:0        ppp_work1

       612      1        136K      115      S     0:0:0:0        ppp_work2

       613      1        136K      115      S     0:0:0:0        mp_main

       618      1        136K      115      S     0:0:0:110      pppoes_main

       619      1        136K      115      S     0:0:0:100      pppoes_mesh

       620      1        136K      115      S     0:0:0:120      l2tp_mesh

       621      1        136K      115      S     0:0:0:20       l2tp_main

-     For a process running on the control core with CPU usage higher than 5%, check the Name field value to identify whether the process is a user-mode process.

If the Name field for a process includes square brackets ([ ]), the process is a kernel thread, and you do not need to execute the monitor thread dumbtty command. If the Name field for a process does not include square brackets ([ ]), the process is a user process and might contain multiple threads. For user processes with multithreading, execute the monitor thread dumbtty command. If the LAST_CPU field of a thread in the output corresponds to the ID of the CPU control core, and the CPU field value is greater than 5%, this thread might cause high CPU core usage. Then, further location is required.

   <Sysname> monitor thread dumbtty slot 1 cpu 0

   206 processes; 342 threads; 5134 fds

   Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

   CPU0: 98.06% idle, 0.97% user, 0.97% kernel, 0.00% interrupt, 0.00% steal

   CPU1: 97.12% idle, 0.96% user, 0.96% kernel, 0.96% interrupt, 0.00% steal

   CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   Memory: 7940M total, 5315M available, page size 4K

           JID      TID  LAST_CPU  PRI  State  HH:MM:SS MAX   CPU     Name

           322       322      2      115    R    00:04:21    0  20.15%   [kdrvfwdd2]

           323       323      3      115    R    00:04:21    0  20.15%   [kdrvfwdd3]

           324       324      4      115    R    00:04:21    0  20.15%   [kdrvfwdd4]

             1         1     1     120    S   00:00:02   21   0.19%   scmd

           376       376     1     120    S   00:00:00    1   0.19%   diagd

             2         2     0     120    S   00:00:00    0   0.00%   [kthreadd]

...

b.     Identify the stacks of an abnormal task.

Execute the follow job command in probe view to identify the stacks of an abnormal task. The following takes the pppd process (with process ID 515) in slot 1 on the device as an example.

<Sysname> system-view

[Sysname] probe

[Sysname-probe] follow job 515 slot 1

Attaching to process 515 (pppd)

Iteration 1 of 5

------------------------------

Thread LWP 515:

Switches: 3205

User stack:

#0  0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1  0x0000000000441745 in ppp_EpollSched+0x35/0x5c

#2  0x0000000000000004 in ??

Kernel stack:

[<ffffffff811f0573>] ep_poll+0x2f3/0x370

[<ffffffff811f06c0>] SyS_epoll_wait+0xd0/0xe0

[<ffffffff814aed79>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

Thread LWP 549:

Switches: 20

User stack:

#0  0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1  0x00000000004435d4 in ppp_misc_EpollSched+0x44/0x6c

Kernel stack:

[<ffffffffffffffff>] 0xffffffffffffffff

...

c.     Identify the task name based on steps a and b, and then find the corresponding service module according to the task name to locate and resolve issues in the service module. For example, if the CPU usage of the snmpd task is high, an SNMP attack might occur, or the NMS frequently accesses the device. Then, further troubleshooting is required for the SNMP service module. If the CPU usage of the nqad task is high, the NQA detection might be performed too frequently. Then, further troubleshooting is required for the NQA service module.

6.     If the issue persists, collect the following information and contact Technical Support:

¡     Results of each step.

¡     The configuration files, log messages, and alarm messages.

Related alarm and log messages

Alarm messages

·     hh3cEntityExtCpuUsageThresholdNotfication

·     hh3cEntityExtCpuUsageThresholdRecover

·     hh3cCpuUsageSevereNotification

·     hh3cCpuUsageSevereRecoverNotification

·     hh3cCpuUsageMinorNotification

·     hh3cCpuUsageMinorRecoverNotification

Log messages

·     DIAG/5/CPU_MINOR_RECOVERY

·     DIAG/4/CPU_MINOR_THRESHOLD

·     DIAG/5/CPU_SEVERE_RECOVERY

·     DIAG/3/CPU_SEVERE_THRESHOLD

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Intelligent Storage
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
  • Technical Blogs
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网