Excessive CPU Usage Technical Topic-6W100

HomeSupportTechnology LiteratureTechnical TopicsExcessive CPU Usage Technical Topic-6W100
Download Book
  • Released At: 12-05-2025
  • Page Views:
  • Downloads:
Table of Contents
Related Documents

Excessive CPU Usage Technical Topic

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Document version: 6W101-20240328

Copyright © 2023 New H3C Technologies Co., Ltd. All rights reserved

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.



Introduction

Switches, as core network devices, are responsible for forwarding data packets and managing network traffic. Excessive CPU usage on a switch can lead to decreased network performance and faults, affecting the stability and reliability of communication.

This document provides a detailed explanation of the common causes, impacts, and troubleshooting methods for excessive CPU usage in switches.

Common causes of excessive CPU usage

The following are the common causes of excessive CPU usage:

·     Network congestion: When the switch processes a large volume of data traffic, the CPU might be overloaded, and the CPU usage is high.

·     Network attack: Malicious software or viruses might severely deplete CPU resources by attacking switches.

·     Protocol flapping: This typically refers to STP or routing protocol flapping, causing devices to frequently perform recalculation and updates.

·     Network loop: Constant traffic circulation caused by network loops requires continuous computation and processing by devices.

·     After flow sampling is configured on the device and the traffic to be processed is too large or the device's sampling frequency is too high, the sampling feature occupies a significant amount of CPU resources.

·     The device generates a large number of log messages. Then, abundant resources are occupied for the generation and management of these log messages.

·     Misconfigured settings: Misconfigured settings, such as incorrect access control list settings or broadcast storms, can cause a rise in CPU usage.

Impacts of excessive CPU usage

When the CPU usage is too high, the following impacts might occur:

·     Performance degradation: Excessive CPU usage can slow down device processing and lead to increased packet handling delays. This can lead to slower network response times and might negatively impact the user experience.

·     Packet loss and increased delay: Excessive CPU usage can affect a device's ability to process packets, potentially causing packet loss within the device or increased delay during forwarding.

·     Decreased system stability: Excessive CPU usage can increase the load on devices, potentially causing failures under heavy traffic and affecting network stability.

·     Service interruption: In extreme cases, if a switch's CPU usage is excessively high for an extended period, the switch might malfunction, leading to network service outages.

·     Security risk: Excessive CPU usage can increase device load, preventing normal detection and blocking of malicious traffic, thus posing a potential risk to network security.

·     System crash: The switch might crash and reboot when the CPU resource exceeds its bearer capability.

Excessive CPU usage can affect the performance and stability of the switch. Therefore, timely monitoring and resolving excessive CPU usage issues are crucial for the network to run smoothly and to ensure a good user experience.


How to identify excessive CPU Usage

If one of the following conditions occurs, the CPU control core usage of the device is high, and you must identify the causes of the excessive CPU usage:

·     During daily inspection of the device, execute the display cpu-usage command repeatedly to view the CPU usage. The CPU usage is higher than 60 or is significantly higher than the daily average.

# Execute the display cpu-usage summary command to view the average CPU usage during the most recent 5-second, 1-minute, or 5-minute interval.

<Sysname> display cpu-usage summary

Slot CPU        Last 5 sec        Last 1 min        Last 5 min

1    0          5%                5%                4%

# Execute the display cpu-usage history command to view the CPU usage in graphical form for the last 60 samples. The data shows that the CPU usage continues to increase or is significantly higher than the daily average value.

·     The CLI responds slowly after you log in to the switch via Telnet or SSH and execute commands.

·     The device outputs log messages about excessive CPU usage on the device.

·     Alarms on excessive CPU usage occur on the SNMP manager.

View device model and software version information

1.     Execute the display device command in any view to view device information.

# Display device information. (Centralized IRF device.)

<Sysname> display device

Slot Type             State    Subslot  Soft Ver             Patch Ver

1    S6820-32H        Master   0        S6820-6103           None

# Display device information. (Distributed devices in standalone mode.)

<Sysname> display device

Slot Type             State    Subslot  Soft Ver             Patch Ver

0    LSXM1SUPB1       Master   0        S12508X-AF-0502      None

1    LSXM1SUPB1       Standby  0        S12508X-AF-0502      None

2    NONE             Absent   0        NONE                 None

3    LSXM1TGS48C2HB1  Normal   0        S12508X-AF-0502      None

4    NONE             Absent   0        NONE                 None

5    NONE             Absent   0        NONE                 None

6    NONE             Absent   0        NONE                 None

7    NONE             Absent   0        NONE                 None

8    NONE             Absent   0        NONE                 None

9    NONE             Absent   0        NONE                 None

10   NONE             Absent   0        NONE                 None

11   NONE             Absent   0        NONE                 None

12   NONE             Absent   0        NONE                 None

13   NONE             Absent   0        NONE                 None

14   NONE             Absent   0        NONE                 None

15   LSXM1SFH08D1     Normal   0        S12508X-AF-0502      None

2.     Execute the display version command in any view to view software version information.

<Sysname> display version

H3C Comware Software, Version 7.1.070, Feature 2607

Copyright (c) 2004-2017 New H3C Technologies Co., Ltd. All rights reserved.

H3C S6800-54QT uptime is 0 weeks, 0 days, 2 hours, 14 minutes

Last reboot reason : Cold reboot

 

Boot image: flash:/s6800-cmw710-boot-f2607.bin

Boot image version: 7.1.070, Feature 2607

  Compiled May 15 2017 16:00:00

System image: flash:/s6800-cmw710-system-f2607.bin

System image version: 7.1.070, Feature 2607

  Compiled May 15 2017 16:00:00

...

View CPU usage

Use the following methods to view CPU usage:

·     Execute the display cpu-usage command in any view. Under normal circumstances, the CPU usage of fixed-port switches, modular switches, and MPUs is below 60%. Determine whether the CPU usage of a switch or MPU is normal based on the average CPU usage over a 5-minute period. If the CPU usage is high, continue troubleshooting according to the following steps.

For example, the following command shows that the CPU usage of the card in slot 3 is abnormally high.

<H3C>display cpu

Slot 0 CPU usage:

       11% in last 5 seconds

       13% in last 1 minute

       13% in last 5 minutes

Slot 3 CPU usage:

       85% in last 5 seconds

       79% in last 1 minute

       71% in last 5 minutes

·     Execute the display logbuffer command to check for log messages on excessive CPU usage.

<Sysname> display logbuffer

Log buffer: Enabled

Max buffer size: 1024

Actual buffer size: 512

Dropped messages: 0

Overwritten messages: 718

Current messages: 512

%Jun 17 15:57:09:578 2019 Sysname SYSLOG/7/SYS_RESTART:System restarted –

...

The following are CPU usage-related log messages:

¡     DIAG/5/CPU_MINOR_RECOVERY

¡     DIAG/4/CPU_MINOR_THRESHOLD

¡     DIAG/5/CPU_SEVERE_RECOVERY

¡     DIAG/3/CPU_SEVERE_THRESHOLD

·     Alarms on excessive CPU usage occur on the SNMP manager.

If an SNMP manager is deployed, you can view alarms on excessive CPU usage on the SNMP manager. When the CPU usage exceeds the specified threshold, the system sends alarms. You can use the monitor cpu-usage threshold command to set CPU usage alarm thresholds.

Collect CPU usage information, and identify the service modules where the CPU usage is high

Identify the processes that are consuming excessive CPU usage

Method 1: Execute the display process cpu to identify processes that occupy most CPU resources within a period. This example displays information about slot 1. If a process has a CPU usage higher than 3% (for reference only), further location is required for that process.

<Sysname> display process cpu slot 1

CPU utilization in 5 secs: 0.4%; 1 min: 0.2%; 5 mins: 0.2%

    JID      5Sec      1Min      5Min    Name

      1      0.0%      0.0%      0.0%    scmd

      2      5.5%      5.1%      5.0%    [kthreadd]

      3      0.0%      0.0%      0.0%    [ksoftirqd/0]

...

Method 1: Execute the monitor process dumbtty command to view the real-time CPU usage of a process. This example displays information about CPU 0 for slot 1.

<Sysname> system-view

[Sysname] monitor process dumbtty slot 1 cpu 0

206 processes; 342 threads; 5134 fds

Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

CPU0: 99.04% idle, 0.00% user, 0.96% kernel, 0.00% interrupt, 0.00% steal

CPU1: 98.06% idle, 0.00% user, 1.94% kernel, 0.00% interrupt, 0.00% steal

CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

Memory: 7940M total, 5273M available, page size 4K

        JID        PID  PRI State  FDs     MEM  HH:MM:SS    CPU   Name

        515        322  115   R     0       0K  01:48:03  20.02%  pppd

        376        376  120   S    22  159288K  00:00:07   0.37%  diagd

          1          1  120   S    18   30836K  00:00:02   0.18%  scmd

        379        379  120   S    22  173492K  00:00:11   0.18%  devd

          2          2  120   S     0       0K  00:00:00   0.00%  [kthreadd]

          3          3  120   S     0       0K  00:00:02   0.00%  [ksoftirqd/0]

...

·     In the output from the monitor process dumbtty command, find the JIDs of processes with CPU usage higher than 3% (for reference only). Then, execute the display process job command for these processes to collect detailed information about the processes, and identify whether the processes are running on the control core.

If  the LAST_CPU field value in the output from the display process job command is the ID of the control core (for example, 0 or 1), the process is running on the CPU control core and further location is required. If the LAST_CPU field value is not the ID of the control core, the process is running on the CPU forwarding core, In this case, no action is required. Take the pppd process as an example. The output shows that this process contains multiple threads, all of which are running on the control core.

<Sysname> display process name pppd

                                Job ID: 515

                                   PID: 515

                            Parent JID: 1

                            Parent PID: 1

                       Executable path: /sbin/pppd

                              Instance: 0

                               Respawn: ON

                         Respawn count: 1

                Max. spawns per minute: 12

                          Last started: Wed Nov  3 09:52:00 2021

                         Process state: sleeping

                             Max. core: 1

                                  ARGS: --MaxTotalLimit=2000000 --MaxIfLimit=65534 --CmdOption=0x01047fbf --bSaveRunDb --pppoechastenflag=1 --pppoechastennum=6 --pppoechastenperiod=60 --pppoechastenblocktime=300 --pppchastenflag=1 --pppchastennum=6 --pppchastenperiod=60 --pppchastenblocktime=300 --PppoeKChasten --bSoftRateLimit --RateLimitToken=2048

       TID  LAST_CPU    Stack      PRI    State   HH:MM:SS:MSEC  Name

       515      0        136K      115      S     0:0:0:90       pppd

       549      0        136K      115      S     0:0:0:0        ppp_misc

       557      0        136K      115      S     0:0:0:10       ppp_chasten

       610      0        136K      115      S     0:0:0:0        ppp_work0

       611      1        136K      115      S     0:0:0:0        ppp_work1

       612      1        136K      115      S     0:0:0:0        ppp_work2

       613      1        136K      115      S     0:0:0:0        mp_main

       618      1        136K      115      S     0:0:0:110      pppoes_main

       619      1        136K      115      S     0:0:0:100      pppoes_mesh

       620      1        136K      115      S     0:0:0:120      l2tp_mesh

       621      1        136K      115      S     0:0:0:20       l2tp_main

·     For a process running on the control core with CPU usage higher than 5%, check the value for the Name field to identify whether the process is a user process.

If the Name field for a process includes square brackets ([ ]), the process is a kernel thread, and you do not need to execute the monitor thread dumbtty command. If the Name field for a process does not include square brackets ([ ]), the process is a user process and might contain multiple threads. For user processes with multithreading, execute the monitor thread dumbtty command. If the LAST_CPU field of a thread in the output corresponds to the ID of the CPU control core, and the CPU field value is greater than 5%, this thread might cause excessive CPU core usage. Then, further location is required.

<Sysname> monitor thread dumbtty slot 1 cpu 0

   206 processes; 342 threads; 5134 fds

   Thread states: 4 running, 338 sleeping, 0 stopped, 0 zombie

   CPU0: 98.06% idle, 0.97% user, 0.97% kernel, 0.00% interrupt, 0.00% steal

   CPU1: 97.12% idle, 0.96% user, 0.96% kernel, 0.96% interrupt, 0.00% steal

   CPU2: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   CPU3: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   CPU4: 0.00% idle, 0.00% user, 100.00% kernel, 0.00% interrupt, 0.00% steal

   Memory: 7940M total, 5315M available, page size 4K

           JID      TID  LAST_CPU  PRI  State  HH:MM:SS MAX   CPU     Name

           322       322      2      115    R    00:04:21    0  20.15%   [kdrvfwdd2]

           323       323      3      115    R    00:04:21    0  20.15%   [kdrvfwdd3]

           324       324      4      115    R    00:04:21    0  20.15%   [kdrvfwdd4]

             1         1     1     120    S   00:00:02   21   0.19%   scmd

           376       376     1     120    S   00:00:00    1   0.19%   diagd

             2         2     0     120    S   00:00:00    0   0.00%   [kthreadd]

...

View the stack information for an abnormal process

Execute the follow job command in probe view to validate the abnormal process’ stack. Query more than five times and transmit the results to Technical Support for analysis, to determine what processing is causing the continuous increase in CPU usage. The following example uses the pppd process (process ID 515) in slot 1 for illustration.

<Sysname> system-view

[Sysname] probe

[Sysname-probe] follow job 515 slot 1

Attaching to process 515 (pppd)

Iteration 1 of 5

------------------------------

Thread LWP 515:

Switches: 3205

User stack:

#0  0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1  0x0000000000441745 in ppp_EpollSched+0x35/0x5c

#2  0x0000000000000004 in ??

Kernel stack:

[<ffffffff811f0573>] ep_poll+0x2f3/0x370

[<ffffffff811f06c0>] SyS_epoll_wait+0xd0/0xe0

[<ffffffff814aed79>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

Thread LWP 549:

Switches: 20

User stack:

#0  0x00007fdc2a3aaa8c in epoll_wait+0x14/0x2e

#1  0x00000000004435d4 in ppp_misc_EpollSched+0x44/0x6c

Kernel stack:

[<ffffffffffffffff>] 0xffffffffffffffff

...

Resolve issues with the service module

Identify the corresponding service module according to the process name identified in steps Identify the processes that are consuming excessive CPU usage and View the stack information for an abnormal process and address the issues within that module. For example, if the snmpd process has excessive CPU usage, it might be due to the device suffering an SNMP attack or the NMS accessing the device too frequently. To further troubleshoot issues with the SNMP service module, identify whether the CPU usage of the nqad process is high. If yes, it might be due to frequent NQA operations. You must further troubleshoot issues with the NQA service module.

Common processes

comsh

Definition

Comsh is a system process on H3C devices, known as the Command Shell Process. The comsh process interprets and executes commands, and manages users and their operations. After user login, the system automatically starts the comsh process, waiting for the user to enter commands. If the comsh process encounters an anomaly or error, it might lead to unstable or incorrect command line operations, affecting the device's management and maintenance.

The comsh process might have high usage due to the following reasons:

·     Improper operation: The comsh process responds to user commands. Mistakes during operation or freezes can easily lead to high comsh process usage.

·     High device load: When a device's load is too high, causing resource scarcity, the comsh process usage might become excessively high.

·     System software anomaly: High comsh process usage might result from anomalies in certain system software.

·     Too many user connections: When a large number of users log in to a device simultaneously, the load on the comsh process is heavy, leading to high comsh process usage.

Solution

·     Inspect user operations to make sure there are no stuck or incorrect operations.

·     Upgrade the hardware or reduce the use of some applications.

·     Reboot the device.

·     Contact Technical Support for further troubleshooting

Ifmgr

Definition

The ifmgr process is a system process, known as the Interface Manager Process. The ifmgr process manages the various interfaces on a device, including physical, logical, and tunnel interfaces. It also monitors, configures, and manages the state of interfaces in real time. If the ifmgr process experiences an anomaly or error, it might cause issues with interface management and maintenance, affecting the device's normal operation and management.

Possible reasons for high ifmgr process usage include:

·     Interface configuration anomaly: Anomalies in device interface configurations, such as duplicate, incorrect, or improper settings, can cause the ifmgr process to malfunction and consume excessive system resources.

·     Hardware issue: Some hardware issues, such as damaged interface modules or unstable voltage, can lead to excessive ifmgr process usage.

·     System software error: Anomalies in the ifmgr process might arise from errors or vulnerabilities in a device's system software, potentially consuming excessive CPU or memory resources.

·     Extensive interface monitoring: Monitoring numerous interfaces within a device, such as multi-line load balancing, might result in the ifmgr process consuming significant system resources.

Solution

·     Inspect the interface configuration to ensure that there are no duplicate or incorrect settings.

·     Identify whether a hardware issue exists on the device.

·     Reboot the device.

·     Contact Technical Support for further troubleshooting

nqad

Definition

The nqad process is a system process, known as the Network Quality Auto Detection. The nqad process monitors the link quality of interfaces by periodically transmitting probe packets. It evaluates indicators such as connection latency, packet loss rate, and bandwidth usage, and sends the results back to the device's management platform. When a fault, such as packet loss, occurs, the nqad process automatically reduces the interface rate and sends an alarm to notify the operator. If the nqad process experiences an anomaly, it might result in link detection failures, leading to unstable link quality and affecting communication between nodes.

Possible reasons for high nqad process usage include:

·     Poor link quality: The nqad process detects link quality. If the link quality is poor, it might consume significant system resources.

·     Data collection interval too short: If he nqad process collects link data at intervals that are too short, the nqad process usage can be high.

·     Anomaly in the running environment: If the environment where the nqad process is running is unstable or experiences anomalies, it might cause the nqad process to malfunction, resulting in excessive CPU usage.

·     Insufficient device resources: If a device lacks resources, the nqad process usage can be high. To alleviate the load, consider upgrading the hardware or shutting down unnecessary applications.

Solution

·     Analyze the logs or network data to identify whether there is an anomaly with the link quality. If yes, it is advisable to further investigate and fix the related issues.

·     Adjust the data collection interval for the nqad process.

·     Reboot the device.

·     Contact Technical Support for further troubleshooting.

scmd

Definition

The scmd process is a system process, known as the System Command Process. The scmd process is responsible for processing and executing command requests from user terminals. If the scmd process encounters an anomaly, it might prevent users from executing normal commands, thus affecting the device's management and maintenance.

Possible reasons for high scmd process usage include:

·     Improper operation: The scmd process responds to user commands. Errors or freezes during operation can easily lead to high scmd process usage.

·     High device load: If a device's load is too high, causing resource scarcity, the scmd process usage might become excessively high.. In such cases, consider upgrading the hardware or reducing the use of some applications.

·     System software anomaly: Anomalies in certain software within the system might cause the scmd process to have excessive usage.

Solution

·     Inspect user operations to make sure there are no stuck or incorrect operations.

·     Use the top command to check the current system load. If the load is too high, consider closing unnecessary applications to reduce it.

·     Reboot the device.

·     Contact Technical Support for further troubleshooting

snmpd

Definition

The term "snmpd" refers to the Simple Network Management Protocol (SNMP) Daemon. In computer networks, SNMP is a standard protocol for managing and monitoring network devices. The snmpd process is responsible for listening to and responding to SNMP protocol requests, providing network device state information, performance data, and profiles for use by the network management system. Using the SNMP protocol, an administrator can remotely monitor and manage devices, detect faults, and collect performance data.

Possible reasons for high snmpd process usage include:

·     Excessive SNMP requests or SNMP attacks can occur because the SNMP protocol is widely applicable. If the snmpd process responds to multiple requests at a time, it can result in excessive CPU resource consumption.

·     High device load: If a device's load is too high, causing resource scarcity, the snmpd process usage might become excessively high.. In such cases, consider upgrading the hardware or reducing the use of some applications.

·     SNMP configuration anomaly: Besides the basic common objects, the SNMP protocol requires configuration support for many device and application-specific objects. Improper SNMP configuration might lead to anomalies or errors in the snmpd process.

·     System software anomaly: Some system software might experience anomalies, resulting in excessive snmpd process usage.

Solution

·     Examine SNMP requests to validate there are no excessive or abnormal requests.

·     Close unnecessary applications to reduce load.

·     Identify whether the SNMP settings are correct.

·     Reboot the device.

·     Contact Technical Support for further troubleshooting.

sshd

Definition

sshd stands for Secure Shell Daemon. SSH is an encryption protocol used for secure remote connections, remote login, and file transmission. The sshd process is the SSH server daemon responsible for accepting and processing connection requests from SSH clients, providing services such as remote login and file transmission. Typically, the sshd process runs continuously in the background, waiting for user connection requests.

Possible reasons for high sshd process usage include:

·     Excessive concurrent connections: If a server receives numerous SSH connection requests, the sshd process might consume significant system resources while processing multiple connections simultaneously.

·     Excessive key configuration: Key authentication is a common method of identity verification in SSH. If a server has numerous key pairs or complex key authentication configurations, the sshd process might consume many CPU resources to resolve and verify keys.

·     Improper configuration: Parameters set in the sshd configuration file might cause high process usage. For example, configuring too many authentication methods or enabling advanced encryption algorithms might increase the load on the sshd process.

·     Malicious attack: If a server is targeted by malicious behaviors such as SSH brute force, password cracking, or DoS attacks, numerous attack attempts might exhaust sshd process resources, leading to increased usage.

Solution

·     Optimize server resources: Ensure that the server has sufficient CPU and memory to support a high number of concurrent connections and cryptographic key configuration.

·     Optimize SSH configuration: Review and adjust the parameter settings in the sshd_config file to avoid using too many authentication methods or enabling complex encryption algorithms.

·     Add access control settings: Restricting SSH access to trusted IP addresses or specific users can reduce the impact of invalid connection requests on the sshd process.

·     Strengthen security policy: Implement measures to prevent malicious attacks, such as enabling IP firewalls, using intelligent SSH anti-cracker tools, and regularly changing cryptographic keys.

 


Handle excessive CPU usage issues

Figure 1 shows the flowchart for troubleshooting excessive CPU usage:

Figure 1 Flowchart for troubleshooting excessive CPU usage

 

Identify whether the device is generating a large number of log messages

In certain anomalies, such as when a device is under attack, encounter an error while running, or experiences frequent port up/down events, the device continuously generates diagnostic or log information. At this time, the system software frequently reads and writes to the memory, resulting in an increase in CPU usage.

Use the following methods to identify whether the device is generating a large number of log messages:

·     Log in to the device via Telnet and execute the terminal monitor command to enable log output to the current terminal.

<Sysname> terminal monitor

The current terminal is enabled to display logs.

After you execute this command, if a large number of abnormal log messages or duplicated log messages are output to the CLI, the device is generating a large number of log messages.

·     Repeatedly execute the display logbuffer summary command. If the total log volume increases obviously, execute the display logbuffer reverse command to view detailed log information to identify whether a large number of abnormal log messages occur or whether a particular log message is repeatedly appearing in large quantities.

<Sysname> display logbuffer summary

  Slot EMERG ALERT  CRIT ERROR  WARN NOTIF  INFO DEBUG

     1     0     0     2     9    24    12   128     0

     5     0     0     0    41    72     8     2     0

    97     0     0    42    11    14     7    40     0

<Sysname> display logbuffer reverse

Log buffer: Enabled

Max buffer size: 1024

Actual buffer size: 512

Dropped messages: 0

Overwritten messages: 0

Current messages: 410

%Jan 15 08:17:24:259 2021 Sysname SHELL/6/SHELL_CMD: -Line=vty0-IPAddr=192.168.2.108-User=**; Command is display logbuffer

%Jan 15 08:17:19:743 2021 Sysname SHELL/4/SHELL_CMD_MATCHFAIL: -User=**-IPAddr=192.168.2.108; Command display logfile in view shell failed to be matched.

...

If the device is generating a large number of log messages, use the following methods to reduce log generation:

·     Disable the log output feature for some service modules.

·     Execute the info-center logging suppress command to disable log output for a module.

·     Execute the info-center logging suppress duplicates command to enable duplicate log suppression.

Identify whether a network attack occurs

In live networks, the most common cause of excessive CPU usage is network attacks. Attackers initiate a surge of abnormal network interactions that overwhelm devices, such as transmitting numerous TCP connection establishment requests or ICMP request messages in a short period. This barrage of attack messages keeps the device busy processing, leading to excessive CPU usage and disrupting the normal operation of its services.

In probe view, execute the display system internal control-plane statistics command to view the statistics of control plane packets, paying attention to the number of discarded packets. If the current CPU usage is high and the Dropped field value is large, the device is likely under a packet attack. Support for the display system internal control-plane statistics command depends on the device model.

<Sysname> system-view

System View: return to User View with Ctrl+Z.

[Sysname] probe

[Sysname-probe] display system internal control-plane statistics slot 1

Control plane slot 1

  Protocol: Default

    Bandwidth: 15360 (pps)

    Forwarded: 108926 (Packets), 29780155 (Bytes)

    Dropped  : 0 (Packets), 0 (Bytes)

  Protocol: ARP

    Bandwidth: 512 (pps)

    Forwarded: 1489284 (Packets), 55318920 (Bytes)

    Dropped  : 122114 (Packets), 491421 (Bytes)

...

If a network attack occurs, first resolve the network attack issue.

Identify whether a protocol flapping occurs on the device

A protocol flapping can cause continuous processing of protocol messages, topology calculations, and entry updates by the device, resulting in excessive CPU usage. In practical applications, the most common protocol flappings are STP protocol flappings and OSPF protocol flappings.

·     For STP protocol flappings, execute the stp port-log command in system view to enable outputting port state transition information. If the CLI of the device frequently outputs the following logs, an STP flapping occurs:

STP/6/STP_DETECTED_TC: Instance 0's port GigabitEthernet1/0/1 detected a topology change.

STP/6/STP_DISCARDING: Instance 0's port GigabitEthernet1/0/1 has been set to discarding state.

STP/6/STP_NOTIFIED_TC: Instance 0's port GigabitEthernet1/0/1 was notified a topology change.

¡     If an STP flapping occurs, first resolve the STP flapping issue.

¡     If no STP flapping occurs, proceed to the next step.

·     For the OSPF protocol flappings, execute the  display ip routing-table command to view routing information. If route entries for the same network segment are frequently created and deleted in the routing table, a route flapping occurs.

If a route flapping occurs or the routes do not exist, troubleshoot link-related issues and IGP routing issues.

Identify whether a network loop occurs

When an Ethernet interface operates in Layer 2 mode and a loop exists, a broadcast storm and a network flapping might occur. A large number of protocol packets sent to the CPU lead to an increase in CPU usage. When a network loop occurs, traffic significantly increases on many ports, with a large proportion of broadcast and multicast packets. To identify whether a device has a network loop or is experiencing a broadcast, multicast, or unknown unicast storm, perform the following steps:

1.     Clear the Ethernet interface traffic statistics.

<Sysname> reset counters interface

2.     Execute the display counters rate inbound interface command multiple times to identify whether the port usage has significantly increased.

<Sysname> display counters rate inbound interface

Usage: Bandwidth utilization in percentage

Interface               Usage(%)     Total(pps) Broadcast(pps) Multicast(pps)

GE5/3/0                     0.01              7            --             --

MGE0/31/0                   0.01              1            --             --

MGE0/32/0                   0.01              5            --             --

VMC1/1/0                    0.05             60            --             --

VMC1/2/0                    0.04             52            --             --

 

 Overflow: More than 14 digits.

       --: Not supported.

3.     If the port usage significantly increases, repeatedly execute the display counters inbound interface command to view the total number of packets received on interfaces and the number of broadcast and multicast packets, which correspond to the values for the Total(pkt), Broadcast(pkt), and Multicast(pkt) fields, respectively. If the proportion of broadcast and multicast packets in the total number of received packets on the interfaces is high, a broadcast or multicast storm might occur. If the number of broadcast and multicast packets has not significantly increased, but the number of the total packets received on interfaces has increased significantly, an unknown unicast packet storm might occur.

<Sysname> display counters inbound interface

Interface                            Total(pkt) Broadcast(pkt) Multicast(pkt) Err(pkt)

GE5/3/0                                 141             27            111        0

MGE0/31/0                           274866          47696              0       --

MGE0/32/0                          1063034         684808              2       --

VMC1/1/0                          11157797        7274558             50        0

VMC1/2/0                           9653898        5619640             52        0

 

 Overflow: More than 14 digits (7 digits for column "Err").

       --: Not supported.

If a link loop occurs, perform the following operations:

·     Troubleshoot the link connection to prevent the occurrence of loops in the physical topology.

·     Execute the display stp command to identify whether STP is enabled and correct the configuration.

·     Execute the display stp brief and display stp abnormal-port commands to check the spanning tree status on neighboring devices. Locate and resolve STP anomalies according to the BlockReason field value in the output from the display stp abnormal-port command.

If the STP configuration is correct, an STP protocol miscalculation might occur, or the protocol calculation is correct but the port driver layer is not blocked as expected. To quickly restore STP and eliminate loops, execute the shutdown/undo shutdown command or unplug and plug the network cable on the interface where the loop occurs, allowing STP recalculation.

·     In Ethernet interface view, execute broadcast-suppression to enable broadcast suppression on an interface, execute multicast-suppression to enable multicast storm suppression, and execute unicast-suppression to enable unknown unicast storm suppression. Alternatively, execute the flow-control command to configure flow control. Support for the broadcast-suppression, multicast-suppression, unicast-suppression, and flow-control commands depends on the device model.

·     Apply QoS policies for rate limiting on multicast, broadcast and unknown unicast packets.

Identify whether flow statistics and sampling features are configured and whether the configured parameters are appropriate

After network traffic monitoring features including NetStream and sFlow are configured on the device, the device will statistically analyze network traffic. If network traffic is high, the CPU usage might be high. In this case, perform the following operations:

·     Configure filter conditions to precisely filter the traffic, and only analyze the traffic that users care about.

·     Configure a sampler and adjust the sampling ratio to ensure that NetStream and sFlow collect statistics that accurately reflect the overall network status while preventing an excessive number of statistical packets from impacting device forwarding performance.

If the issue persists, collect the following information and contact Technical Support:

·     Results of each step.

·     The configuration file, log messages, and alarm messages.

 


Excessive CPU usage cases

Excessive CPU usage caused by a Layer 2 loop in the network

Symptom

Latency occurs when you log in to the device. The device's CPU usage has significantly increased, and its ports are experiencing anomalous fluctuations, with some ports receiving the same packets repeatedly in a short period.

Root cause

If a port repeatedly receives the same packets in a short time, it might be due to a Layer 2 loop in the network. A Layer 2 loop is formed when there is more than one path between two devices, leading to a broadcast storm and endless traffic circulation.

Location

1.     Execute the display cpu-usage summary command to view current CPU usage statistics. The CPU usage is as high as 88%.

<Sysname> display cpu-usage summary

Slot CPU        Last 5 sec        Last 1 min        Last 5 min

1    0          88%                83%                81%

2.     Clear the Ethernet interface traffic statistics.

<Sysname> reset counters interface

3.     Execute the display counters rate inbound interface command multiple times to identify that the port usage has significantly increased.

# Execute the display counters rate inbound interface command for the first time.

<Sysname> display counters rate inbound interface

Usage: Bandwidth utilization in percentage

Interface               Usage(%)     Total(pps) Broadcast(pps) Multicast(pps)

GE1/0/1                  10.01       4527863793  2677938345     2683457793

GE1/0/2                  9.23        5727856379  2476385793     2876453793

 Overflow: More than 14 digits.

       --: Not supported.

# Execute the display counters rate inbound interface command for the second time.

<Sysname> display counters rate inbound interface

Usage: Bandwidth utilization in percentage

Interface               Usage(%)     Total(pps) Broadcast(pps) Multicast(pps)

GE1/0/1                  78.12       Overflow     Overflow      Overflow

GE1/0/2                  80.81       Overflow     Overflow      Overflow

 Overflow: More than 14 digits.

       --: Not supported.

4.     Execute the display mac-address mac-move to view MAC address move records.

<Sysname> display mac-address mac-move

MAC address    VLAN Current port  Source port   Last time           Times

0000-0001-002c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0000-0001-002c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

0000-0001-003c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0000-0001-003c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

0000-0001-004c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0000-0001-004c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

0100-0001-005c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0100-0001-005c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

0100-0001-006c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0100-0001-006c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

0100-0001-007c 1    GE1/0/1       GE1/0/2       2013-05-20 13:40:52 20

0100-0001-007c 1    GE1/0/2       GE1/0/1       2013-05-20 13:41:32 20

---  121 MAC address moving records found  ---

The output shows that the device is experiencing a large number of MAC address migrations, indicating a Layer 2 loop in the network.

Solution

·     Troubleshoot the link connection to prevent the occurrence of loops in the physical topology.

·     Execute the display stp command to identify whether STP is enabled and correct the configuration.

·     Execute the display stp brief and display stp abnormal-port commands to check the spanning tree status on neighboring devices. Locate and resolve STP anomalies according to the BlockReason field value in the output from the display stp abnormal-port command.

·     If the STP configuration is correct, an STP protocol miscalculation might occur, or the protocol calculation is correct but the port driver layer is not blocked as expected. To quickly restore STP and eliminate loops, execute the shutdown/undo shutdown command or unplug and plug the network cable on the interface where the loop occurs, allowing STP recalculation.

·     In Ethernet interface view, execute broadcast-suppression to enable broadcast suppression on an interface, execute multicast-suppression to enable multicast storm suppression, and execute unicast-suppression to enable unknown unicast storm suppression. Alternatively, execute the flow-control command to configure flow control. Support for the broadcast-suppression, multicast-suppression, unicast-suppression, and flow-control commands depends on the device model.

·     Apply QoS policies for rate limiting on multicast, broadcast and unknown unicast packets.

Excessive CPU usage caused by an ARP attack

Symptom

The CPU usage of a core switch has risen above 95%, disrupting communication for all connected devices and causing network outages and service interruptions.

Root cause

Due to an attacker transmitting a large number of ICMP messages into the network, the switch's CPU load became too high to process normal network traffic.

Location

1.     Execute the display cpu-usage summary command to view current CPU usage statistics. The CPU usage is as high as 88%.

<Sysname> display cpu-usage summary

Slot CPU        Last 5 sec        Last 1 min        Last 5 min

1    0          98%                93%                91%

2.     Execute the display system internal control-plane statistics command in probe view to the packets statistics for the control plane.

<Sysname> system-view

System View: return to User View with Ctrl+Z.

[Sysname] probe

[Sysname-probe] display system internal control-plane statistics slot 1

Control plane slot 1

  Protocol: Default

    Bandwidth: 15360 (pps)

    Forwarded: 108926 (Packets), 29780155 (Bytes)

    Dropped  : 0 (Packets), 0 (Bytes)

  Protocol: ARP

    Bandwidth: 512 (pps)

    Forwarded: 1489284 (Packets), 55318920 (Bytes)

    Dropped  : 1221124 (Packets), 4491421 (Bytes)

The output shows that the control plane dropped a large number of ARP packets.

3.     Execute the debug rxtx softcar show command in probe view.

[Sysname-probe] debug rxtx softcar show slot 1

 

ID  Type                RcvPps Rcv_All    DisPkt_All Pps  Dyn Swi Hash Am APps 

0   ROOT                0      0          0          200  S   On  SMAC 0 0     

1   ISIS                0      0          0          200  D   On  SMAC 8 512   

2   ESIS                0      0          0          100  S   On  SMAC 8 512   

31  ARP                 143    183008857  59300      750  S   On  SMAC 8 -     

The output shows that a large number of ARP packets were sent to the CPU.

The information above indicates that the device is currently under an ARP attack.

Solution

1.     Use the debugging arp packet command to enable ARP packet debugging.

<Sysname> debugging arp packet

*May 14 18:14:36:453 2023 S105-IRF ARP/7/ARP_RCV: -MDC=1-Chassis=1-Slot=7; Received an ARP message, operation: 1, sender MAC: 0024-7e04-578d, sender IP: 192.168.50.30, target MAC: 0000-0000-0000, target IP: 192.168.50.1

*May 14 18:14:36:453 2023 S105-IRF ARP/7/ARP_RCV: -MDC=1-Chassis=1-Slot=7; Received an ARP message, operation: 1, sender MAC: 0024-7e04-578d, sender IP: 192.168.50.30, target MAC: 0000-0000-0000, target IP: 192.168.50.2

*May 14 18:14:36:453 2023 S105-IRF ARP/7/ARP_RCV: -MDC=1-Chassis=1-Slot=7; Received an ARP message, operation: 1, sender MAC: 0024-7e04-578d, sender IP: 192.168.50.30, target MAC: 0000-0000-0000, target IP: 192.168.50.3

*May 14 18:14:36:453 2023 S105-IRF ARP/7/ARP_RCV: -MDC=1-Chassis=1-Slot=7; Received an ARP message, operation: 1, sender MAC: 0024-7e04-578d, sender IP: 192.168.50.30, target MAC: 0000-0000-0000, target IP: 192.168.50.4

The output shows that the device is under an ARP attack with the source MAC address 0024-7e04-578d.

2.     Enable source MAC-based ARP packet rate limit and set the limit rate.

<Sysname>system-view

System View: return to User View with Ctrl+Z.

[Sysname]arp rate-limit source-mac 0024-7e04-578d 50

3.     Enable the source MAC-based ARP attack detection feature and specify the filter handling method.

[Sysname]arp source-mac filter

 


How to avoid excessive CPU usage

·     Enable loop detection and STP.

·     Configure attack prevention features.

·     The device offers a variety of attack defense technologies to detect and resolve attacks in a LAN. For information about attack prevention features, see the security configuration guide of the device.

·     Regularly update the operating system and firmware versions: Doing so helps prevent security holes and performance issues, ensuring the device runs smoothly and securely.

·     Set up port mirroring as needed: Port mirroring helps monitor and detect network traffic, allowing for the early identification of anomalies and threats within the network.

·     Set up reasonable port security policies: Implementing sensible port security measures can limit malicious traffic in the network and safeguard the security of devices and the network.

·     Enable the defense against high-traffic attacks: High-traffic attacks are a common type of network attack, and activating this defense can effectively prevent devices from being overwhelmed by excessive CPU resource usage.

 


Related alarm and log messages

Alarm messages

·     hh3cEntityExtCpuUsageThresholdNotfication

·     hh3cEntityExtCpuUsageThresholdRecover

·     hh3cCpuUsageSevereNotification

·     hh3cCpuUsageSevereRecoverNotification

·     hh3cCpuUsageMinorNotification

·     hh3cCpuUsageMinorRecoverNotification

Log messages

·     DIAG/5/CPU_MINOR_RECOVERY

·     DIAG/4/CPU_MINOR_THRESHOLD

·     DIAG/5/CPU_SEVERE_RECOVERY

·     DIAG/3/CPU_SEVERE_THRESHOLD

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Intelligent Storage
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
  • Technical Blogs
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网