- Table of Contents
- Related Documents
-
Title | Size | Download |
---|---|---|
01-AIOps configuration | 172.08 KB |
License requirements for AI model file loading
Display and maintenance commands for AI models
AI ECN configuration tasks at a glance
Display and maintenance commands for AI ECN
Configuring AI for log aggregation and root cause analysis
About AI for log aggregation and root cause analysis
Format of the log summary file
Format of the root cause analysis file
AI for log aggregation and root cause analysis tasks at a glance
Enabling AI for log aggregation and root cause analysis
Saving the log aggregation summary file or root cause analysis file
Display and maintenance commands for AI for log aggregation and root cause analysis
Configuring AI for device anomaly detection
About AI for device anomaly detection
AI for device anomaly detection tasks at a glance
Enabling AI for device anomaly detection
Saving the log file for AI-based device anomaly detection
Display and maintenance commands for AI-based device anomaly detection
Configuring AIOps
AIOps overview
AIOps is artificial intelligence for IT operations. AI is profoundly transforming human society at an unprecedented pace. Various ICT devices are also actively utilizing AI technology to improve device operation efficiency and enhance traditional maintenance methods, thereby meeting user demands for cost reduction and efficiency improvement. The three core elements of AI are algorithms, computing power, and data. Devices utilize various AI algorithms and massive amounts of sample data to achieve different AI functions through the computing power of chips on the device.
AIOps functions
AI ECN
AI for Explicit Congestion Notification (AI ECN) uses AI algorithms and data models to dynamically predict the optimal ECN threshold for the queue. Devices transmit packets with ECN markers based on the optimized threshold to reduce network congestion. This ensures low latency and high throughput in complex network environments. For more information about AI ECN, see "Configuring AI ECN."
AI for log aggregation and root cause analysis
This function aggregates all the log information received by the information center within a certain period of time based on the correlation between the logs. Based on the log aggregation results and the root cause analysis library, it infers the possible causes of faults and outputs a root cause analysis file for users to reference. For more information about this function, see "Configuring AI for log aggregation and root cause analysis."
AI for device anomaly detection
This function uses algorithms for anomaly detection on CPU usage or usage of other resources. It outputs alarm messages based on the detected anomaly. Compared to manually configured usage alarm thresholds, this function can detect anomalies more accurately. For more information about this function, see "Configuring AI for device anomaly detection."
AI models
AI functions often require training on extensive laboratory data to generate a reliable algorithm model. You can copy the algorithm model file directly to the device after downloading it from the specified official website, or use the analyzer to push it to the device through Telemetry.
After you load a model file to the device, the enabled AI services automatically read the model file based on its filename. They then infer the optimal configuration values by using the algorithm model file, thereby implementing the AI functionalities.
License requirements for AI model file loading
The AI model file loading feature is limited by licenses. To use this feature, first install the corresponding licenses. For more information, see license management in Fundamentals Configuration Guide.
Managing AI model files
Restrictions and guidelines
To ensure that different AI functions can correctly identify their respective model files, the filenames obtained from the official website or analyzer follow certain naming conventions. As a best practice, do not manually change the filenames, as doing so might result in AI function being unable to recognize the model files.
Take a model file named h3caiecn-1-dqn-001 as an example:
· The aiecn string represents the process name associated with the AI function.
· A value of 1 indicates that the model file is generated through analyzer cloud pushing, while a value of 0 means that the file is manually downloaded from the official website.
· The dqn string that denotes the name of the algorithm and other descriptions used by the AI function. An AI function can be implemented based on various AI algorithms.
· The 001 string indicates the version number.
If the currently loaded AI model file is not applicable to the device's live environment, you can unload the model file by using a command.
Procedure
1. Enter system view.
system-view
2. Enter AI service view.
ai-service
3. Load an AI model file.
model load file-name
By default, no AI model files exist.
To load an AI model file, first save the AI model file in the device's storage space.
4. (Optional.) Unload an AI model file.
model unload file-name
Display and maintenance commands for AI models
Execute display commands in any view.
Task |
Command |
Display information about the loaded AI model files. |
display ai-service model |
Configuring AI ECN
About AI ECN
AI ECN is a dynamic congestion notification technology implemented using AI algorithms. AI ECN can be used in an intelligent lossless network to provide congestion avoidance for RDMA over converged Ethernet version 2 (RoCEv2) traffic.
Basic concepts
ECN uses the DS field in the IP packet header to mark the congestion status on the transmission path. An ECN-capable endpoint can determine congestion on the transmission path through ECN marks in packets, and adjust transmission rates to prevent worsening congestion.
RFC 2481 defines the last two bits in the DS field of the IP packet header as the ECN field:
· Bit 6 is used to identify whether the sending device supports the ECN function, known as the ECN-Capable Transport (ECT) bit.
· Bit 7 is used to identify whether a packet has encountered congestion on its transmission path, known as the Congestion Experienced (CE) bit.
Figure 1 ECN field in IPv4 packet header
As shown in Figure 1, RFC 3168 defines the values of the ECN field for an IPv4 packet as follows:
· If the value of the ECN field is 00, it indicates that the sending device does not support ECN.
· When the value of the ECN field is 01 or 10, it indicates that the sending device supports ECN, marked as ECT(0) or ECT(1) respectively.
· When the value of ECN field is 11, it indicates that congestion has occurred on the forwarding path of the packet, marked as CE.
RFC 3168 defines the last two bits of the Traffic Class field in the IPv6 header as the ECN field.
Static ECN
Definition of static ECN
Static ECN refers to ECN that works with WRED. You manually configure WRED parameters for a queue (including the upper threshold and lower threshold for the average queue length) and then enable ECN for the queue. For more information about static ECN, see QoS in ACL and QoS Configuration Guide.
Advantages and disadvantages of static ECN
Static ECN has the following advantages:
· Setting a proper lower threshold allows devices to detect congestion on the path in advance and have the receiving end notify the transmitting end to slow down the transmission rate.
· The forwarding device marks the ECN field as 11 for packets exceeding the lower threshold. This avoids the process of message discard and retransmission in the network, reducing network delay.
· When congestion occurs in the network, the sender gradually lowers the packet transmission rate within a certain time. After congestion disappears, the sender gradually increases the rate to avoid rapid changes of network throughput before and after congestion.
However, traffic passing through each queue might dynamically change over time. Network administrators cannot adapt to real-time traffic changes through static ECN threshold settings.
· If the ECN threshold is set too high, the forwarding device uses longer queues and more buffers to ensure the rate of traffic transmission, meeting the bandwidth requirements of large flows. However, during congestion in the queue, packets waiting in the buffer can cause significant queue delay, which is unacceptable for small, latency-sensitive flows.
· If the ECN threshold is set too low, the forwarding device uses shorter queues and less buffers to reduce the queuing delay in the queue, meeting the latency requirements of small flows. However, a low ECN threshold can reduce network throughput, limiting transmission of large flows.
Due to the previous reasons, an intelligent real-time ECN low-threshold control function, which is called AI ECN, is introduced.
AI ECN
As shown in Figure 2, AI ECN uses AI service components on the device or an analyzer to dynamically optimize ECN thresholds according to specific rules. The AI business components, built into network devices or analyzers, are crucial for ECN dynamic optimization. They consist of a three-level functional framework:
· The data collection and analysis layer provides data collection interfaces for collecting massive amounts of data, preprocesses, and analyzes the collected data.
· The model management layer manages model files and infers the AI ECN threshold based on AI function models mounted by users.
· The algorithm layer invokes the data collection interface to obtain real-time data, and uses the fixed-step search algorithm to calculate the AI ECN threshold.
As shown in Figure 2, AI ECN is implemented as follows:
1. The forwarding chip inside the device collects traffic pattern information, such as queue buffer occupancy, traffic throughput, large and small flow ratio, and other data. Then, it passes real-time traffic characteristics through Telemetry to the data collection and analysis layer of the AI service component.
2. After receiving traffic pattern information, the AI service component analyzes the current traffic pattern through the data collection and analysis layer and determines whether it matches a traffic model in the management layer.
¡ If a match is found, the AI business component infers the optimal value of the real-time ECN threshold based on the matching traffic model. This generation method of AI ECN is called model reasoning and uses the Neural Network algorithm.
¡ If the traffic pattern does not match a traffic model, the AI service component modifies the ECN thresholds by a fixed step based on the current network state while ensuring high bandwidth and low latency. The updated ECN thresholds will then be deployed to the forwarding chip. After setting new ECN thresholds, the AI service component adjusts them based on collected new traffic patterns until the optimal ECN thresholds are obtained. This generation method of AI ECN is called heuristic reasoning mode.
3. After AI ECN is enabled on the device, the forwarding chip will automatically receive ECN data pushed by the AI business component and adjust the ECN threshold value according to the optimal ECN threshold issued by the AI business component.
4. The linkage mechanism between AI business components and forwarding chips can dynamically adjust the ECN threshold to match the real-time traffic flows.
¡ When a queue has a high proportion of small flows, the ECN triggering threshold is reduced to ensure low latency for small flows.
¡ When a queue has a high proportion of large flows, the ECN trigger threshold is increased to ensure high throughput for large flows.
Licensing requirements
The AI ECN feature is limited by licenses. To use the AI ECN feature, first install the corresponding licenses. For more information, see Fundamentals Configuration Guide.
AI ECN configuration tasks at a glance
To configure AI ECN, perform the following tasks:
1. Enabling AI ECN for a queue
Enabling AI ECN for a queue
About this task
This function enables the device to collect and send traffic characteristics to the AI service component on an analyzer or the local AI service component by using NetAnalysis. The AI service component dynamically sets the optimal ECN triggering threshold for a queue to achieve low delay and high throughput. For more information about NetAnalysis, see Network Management and Monitoring Configuration Guide.
The following AI ECN modes are supported based on the chip and hardware capabilities:
· Centralized mode—The analyzer calculates the ECN triggering threshold and communicates it to devices. This mode does not require high hardware capabilities for devices.
· Distributed mode—The device itself intelligently sets the optimal ECN triggering threshold for queues. This mode requires high hardware capabilities for devices and consumes device CPU resources.
· Neural mode—The neural network algorithm of the device intelligently sets the optimal ECN triggering threshold for queues. This mode requires the device chip to support the neural network algorithm.
Restrictions and guidelines
This feature is mutually exclusive with the following settings:
· Applying a WRED table to an interface.
· Configuring WRED parameters for a queue.
· Setting the WRED exponent for average queue size calculation.
· Enabling ECN for a queue.
· Enabling global WRED Smart ECN.
For more information about WRED, see QoS in ACL and QoS Configuration Guide.
Prerequisites
To configure AI ECN in an intelligent lossless network, first configure NetAnalysis for RoCEv2 traffic:
· Use the netanalysis rocev2 mode command to set the mode of RoCEv2 traffic analysis.
· Use the netanalysis rocev2 statistics command to enable RoCEv2 traffic statistics collection.
· Use the netanalysis rocev2 ai-ecn enable command to enable AI ECN for RoCEv2 traffic statistics collection.
Procedure
1. Enter system view.
system-view
2. Enter AI service view.
ai-service
3. Enable AI ECN and set the AI ECN mode.
ai ai-ecn enable mode { centralized | distributed | neural }
4. Enter AI ECN view.
ai-ecn
5. Enable AI ECN for a queue.
queue queue-id enable
By default, AI ECN is disabled for a queue.
Saving the AI ECN log file
About this task
After you enable AI ECN for a queue and enable the AI ECN log file feature, the device logs operations for adjusting the queue's optimal ECN threshold and the basis for the adjustment (the data flow preprocessing result) into the AI ECN log file. The device automatically saves this log file in its local storage. The log file that AI ECN automatically saves typically contains the AIECN string as an identifier.
The AI ECN log file can help operations and technical support analyze the effectiveness of AI ECN.
Procedure
Execute this command in any view to save the AI ECN log file:
ai ai-ecn save logfile
Display and maintenance commands for AI ECN
Execute display commands in any view.
Task |
Command |
Display information in the AI ECN log file |
display ai ai-ecn logfile [ tail line-number ] |
Configuring AI for log aggregation and root cause analysis
About AI for log aggregation and root cause analysis
Overview
The information center can receive log information generated by all modules and classify and manage the log information according to modules and log levels. However, users still cannot quickly find key information from massive log information or perform accurate and efficient fault location based on the key log information.
This function can implement the following:
· The AI process aggregates all log information received by the information center over a period of time based on log correlation. It generates a summary file according to the log aggregation result, simplifying log information. For example, multiple logs from a main interface and its subinterfaces caused by a main interface fault will be aggregated into one log summary.
· The AI process derives potential fault causes from aggregated log results and fault root cause model files in devices, and exports root cause analysis files as references for users.
Data source
The data source for AI log aggregation and root cause analysis is the log information generated by this device. This function obtains log information from the information center of the device.
After you enable the information center function on device and configure local loopback address 127.0.0.1 as the log host address, the AI process listens to log information of the information center on local UDP port 514.
Format of the log summary file
Fields in the log summary file
You can use the display ai ai-fault-analysis summary command to view information in the log summary file. Table 1 shows the meaning for each field in the file.
Table 1 Fields in the log summary file
Field |
Description |
Example |
Remarks |
Time window |
The time span of the log summary. |
2021-03-08 22:19:00 to 22:19:59 |
Expressed in the form of year, month, day, hours, minutes, and seconds. |
Device address |
Device IP address. |
77.1.1.4 |
The device information includes device address and device name. It is located after device. |
Device name |
Device name corresponding to the log summary. |
(S7503E) |
|
Log event |
Summary information of the log. |
Transceiver on Ten-GigabitEthernet1/2/0/23 is NOT sold by H3C. |
The content of the log event comes after the prompt encountered the following events:. |
Impact |
Potential impact of all logs within the time window. |
Impact: Flow stability on Ten-GigabitEthernet1/2/0/23. |
· When there is no impact, this field is empty. · No impact entry number is provided if only one impact entry exists. · Impact entry numbers are provided if two or more impact entries exist. |
Aggregated entries |
The total number of logs aggregated for the log summary. |
Aggregated entries: 3. |
The number of logs included the log summary. |
Highest severity level |
The highest severity level of aggregated logs. |
The highest severity level: Error. |
The highest severity level of aggregated logs. |
Example log summary file
The following displays a sample of a log summary file:
· There is only one log event.
2021-03-08 22:15:00 to 22:15:59, device 80.0.0.13(16X-B) encountered the following events: Interface Ten-GigabitEthernet11/2/1 link down. Aggregated entries: 7. Highest severity level: Error.
· There are multiple log events.
2021-03-08 22:17:00 to 22:17:59, device 80.0.0.13(16X-B) encountered the following events: 1. Interface Ten-GigabitEthernet5/2/2.2 physical down. 2. Interface Ten-GigabitEthernet5/2/2.1 physical down. 3. Interface Ten-GigabitEthernet5/2/2.3 physical down. Aggregated entries: 3. Highest severity level: Error.
· No impact.
2021-03-08 22:22:00 to 22:22:59, device 192.28.200.201(12508 W) encountered the following events: Many MAC addresses moved from port Ten-GigabitEthernet4/0/5:1 to port Ten-GigabitEthernet4/0/5:2. Aggregated entries: 5. Highest severity level: Warning.
· There is only one impact entry.
2021-03-08 22:19:00 to 22:19:59, device 77.1.1.4(S7503E) encountered the following events: 1. Transceiver on Ten-GigabitEthernet1/2/0/23 is NOT sold by H3C. 2. Transient physical state change on interface Tunnel4. 3. Interface Tunnel4 physical up. Impact: Flow stability on Ten-GigabitEthernet1/2/0/23. Aggregated entries: 22. Highest severity level: Error.
· There are multiple impact entries.
2021-03-08 22:21:00 to 22:21:59, device 192.28.200.201(12508 W) encountered the following events: 1. Fan 1 failed. 2. Slot 2 rebooting. Impact: 1. IS-IS neighbor down. 2. OSPF neighbor down. 3. OSPFv3 neighbor down. Aggregated entries: 9. Highest severity level: Critical.
Format of the root cause analysis file
Fields in the file
You can use the display ai ai-fault-analysis root-cause command to view the root cause analysis file. Table 2 shows the meaning of each field in the file.
Field |
Description |
Example |
Remarks |
Time |
Expressed in year, month, day, hours, minutes, and seconds. |
"2021-03-08 22:15:00" |
Fault occurrence time. |
Fault description |
Summary information of a fault alarm. |
fault BFD_Session_Down occurred on (device=77.1.1.41, session=21.2.1.1/21.2.1.2). |
The information includes a brief description of the fault, the address of the faulty device, and service information. |
Fixed description field |
There are two fixed description fields based on the presence of a suspected root cause. · root issue no causes found · Possible root cause |
Possible root cause: |
There might be faults that AI processes are unable to detect. |
Severity level |
The severity level of a fault or the severity level of the log indicating the root cause of the fault. |
severity=Error |
If there is no root cause, the severity level of the fault is displayed. If there is a suspected root cause, the severity level of the fault's log is displayed. |
Possibility of the suspected root cause |
Possibility of the suspected root cause. |
probability=0.8 |
This information is not displayed when there is no root cause. |
The time when the fault's root cause occurred. |
A logged event might be the root cause of the fault. This information is the time when the event was logged. |
2021/3/8 22:21:52 |
This information is not displayed when there is no root cause. |
Device address for the fault |
A logged event might be the root cause of the fault. This information is the IP address of the device where the event was logged. |
192.28.200.201 |
The device information includes device address and device name. It is located after device. |
Device name for the root cause of the fault |
A logged event might be the root cause of the fault. This information is the name of the device where the event was logged. |
(12508 W) |
|
Network elements involved in the root cause of the fault |
The specific interface or slot information related to the root cause of the fault. |
(device=192.28.200.201, mdc=1, chassis=0, slot=2) |
The network elements are located after encountered xxxx on. |
Details |
Provide a detailed description of the fault cause or the fault itself. |
Details: Board is rebooting on slot 2. |
There are two types of descriptions depending on the presence of root cause. · The detailed description of the fault appears when there is no root cause. · If there is a suspected root cause, it will be displayed as a detailed root cause description. |
Output sample
· No root cause:
2021-03-09 10:43:30, fault BFD_Session_Down occurred on (device=77.1.1.41, session=21.2.1.1/21.2.1.2). root issue [severity=Notification], no causes found. Details: Sess[21.2.1.1/21.2.1.2, LD/RD:4128/4119, Interface:Vlan202, SessType:Ctrl, LinkType:INET], Ver:1, Sta: UP->DOWN, Diag: 3 (No Diagnostic)
· There is only one root cause.
2021-03-08 22:21:43, fault OSPF_Neighbor_Down occurred on (device=192.28.200.201, route=OSPF, ospfId=1). Possible root cause: [severity=Notification, probability=0.8] 2021-03-08 22:21:52, device 192.28.200.201(12508 W) encountered DEV_BOARD_REBOOT on (device=192.28.200.201, mdc=1, chassis=0, slot=2). Details: Board is rebooting on slot 2.
· There are multiple root causes.
2021-03-08 22:20:30, fault Interface_Flapping occurred on (device=77.1.1.4, mdc=1, port=Tunnel4). Possible root causes: 1. [severity=Error, probability=1.0] 2021-03-08 22:20:30, device 77.1.1.4(S7503E) encountered IFNET_TNL_PHY_UPDOWN_DERIVE on (device=77.1.1.4, mdc=1, port=Tunnel4). Details: Transient physical state change between up and down occurred on interface Tunnel4. 2. [severity=Warning, probability=0.7] 2021-03-08 22:19:26, device 77.1.1.4(S7503E) encountered OPTMOD_PHONY_MODULE on (device=77.1.1.4, mdc=1, chassis=1, slot=2, port=Ten-GigabitEthernet1/2/0/23). Details: Ten-GigabitEthernet1/2/0/23: This transceiver is NOT sold by H3C. H3C therefore shall NOT guarantee the normal function of the device or assume the maintenance responsibility thereof!
AI for log aggregation and root cause analysis tasks at a glance
To configure AI for log aggregation and root cause analysis, perform the following tasks:
1. Enabling AI for log aggregation and root cause analysis
2. Saving the log aggregation summary file or root cause analysis file
Prerequisites
Enable the information center and configure local loopback address 127.0.0.1 as the log host address. UDP port 514 is used as the listening port and cannot be modified.
Enabling AI for log aggregation and root cause analysis
1. Enter system view.
system-view
2. Enter AI service view.
ai-service
3. Enable AI for log aggregation and root cause analysis.
ai ai-fault-analysis enable
By default, AI for log aggregation and root cause analysis is disabled.
Saving the log aggregation summary file or root cause analysis file
To save the log aggregation summary file, root cause analysis file, or log file, execute the following command in any view:
ai ai-fault-analysis save { logfile | root-cause | summary }
By default, the system does not save the log aggregation summary file or root cause analysis file.
Display and maintenance commands for AI for log aggregation and root cause analysis
Execute display commands in any view.
Task |
Command |
Display information in the log aggregation summary file and root cause analysis file. |
display ai ai-fault-analysis { logfile | root-cause | summary } [ tail line-number ] |
Configuring AI for device anomaly detection
About AI for device anomaly detection
Overview
When managing and maintaining devices, you can manually configure alarm thresholds for different resources. For example, you can execute the resource-monitor resource command to configure alarm thresholds for CPU or ARP resources. The system will generate alarm messages when the specified thresholds are reached. However, these types of alarm messages cannot reflect changes in resource usage trends. Additionally, misconfiguring alarm thresholds can interfere with fault diagnosis and hinder the development of future AIOps methods.
This function uses AI algorithms to infer whether there are anomalies in the usage of various resources and table entries on the device. It triggers alarm messages after determining anomalies, which is more scientific compared to the traditional way of triggering alarms through static alarm threshold settings.
Operating principle
This function supports monitoring the CPU usage and usage of various table resources. You can use the display resource-monitor command to view resource monitoring information.
The data collected varies based on the device model. This function uses a maximum of three different algorithms to dynamically detect anomalies in resource usage for the current performance metrics. The main operating mechanism includes:
1. Data collection: The AI anomaly detection process collects resource usage data of different indicators on the device through NETCONF. Each package contains key information such as device information, timestamp, and resource usage of indicators.
2. In the first round of AI algorithm detection, the AI anomaly detection process learns from historical indicators collected from the device to analyze their change trend and infer the dynamic threshold range model. The AI anomaly detection process can also directly load AI model files to generate the dynamic threshold range model for data indicators. The AI anomaly detection process detects real-time data from the network based on the model. If the collected data falls within the model's threshold range, there is no anomaly. Otherwise, the data is marked as abnormal. The AI anomaly detection process can automatically select the appropriate first-round algorithm based on the fluctuation and trend characteristics of different indicators. The blue region in Figure 4 indicates the dynamic threshold range for normal indicators.
Figure 3 Algorithm for the first round of anomaly detection
3. In the second round of AI algorithm detection, indicators flagged as anomalies by the first round will undergo a second anomaly detection algorithm. The second round of AI algorithm detection is based on a voting anomaly detection mechanism using multiple unsupervised algorithms. Based on the statistical characteristics of the data, various AI algorithms are used for anomaly decision. Indicators identified as anomalies might require a third round of AI algorithm detection depending on the device and type of indicators.
4. The third round of AI algorithm testing based on unsupervised deep learning algorithm (deepar) in device chips.
5. The AI anomaly detection process generates an alarm for abnormal usage only when all rounds of anomaly detection show abnormal results. Using multiple rounds of AI algorithm detection improves accuracy and intelligence of anomaly detection.
AI for device anomaly detection tasks at a glance
To configure AI for device anomaly detection, perform the following tasks:
1. Enabling AI for device anomaly detection
2. Saving the log file for AI-based device anomaly detection
Enabling AI for device anomaly detection
1. Enter system view.
system-view
2. Enter AI service view.
ai-service
3. Enable AI for device anomaly detection.
ai key-resource-monitor enable
By default, AI for device anomaly detection is disabled.
Saving the log file for AI-based device anomaly detection
About this task
The log file for AI-based device anomaly detection can help operations and technical support analyze the effectiveness of the AI-based device anomaly detection feature. The log for AI-based device anomaly detection contains usage information of various metrics collected and the basis for multiple rounds of AI algorithm decisions.
Procedure
Execute this command in any view to save the log file for the AI-based device anomaly detection feature.
ai key-resource-monitor save logfile
By default, the system does not automatically save the log file for the AI-based device anomaly detection feature.
Display and maintenance commands for AI-based device anomaly detection
Execute display commands in any view.
Task |
Command |
Display the running log for AI-based device anomaly detection. |
display ai key-resource-monitor logfile [ tail line-number ] |