19-AIOps Configuration Guide

HomeSupportConfigure & DeployConfiguration GuidesH3C S6805 & S6825 & S6850 & S9850 Configuration Guides-Release 6715-6W10019-AIOps Configuration Guide
Table of Contents
Related Documents
01-AIOps configuration
Title Size Download
01-AIOps configuration 126.37 KB

Configuring AIOps

AIOps overview

AIOps is artificial intelligence for IT operations. AI is profoundly transforming human society at an unprecedented pace. Various ICT devices are also actively utilizing AI technology to improve device operation efficiency and enhance traditional maintenance methods, thereby meeting user demands for cost reduction and efficiency improvement. The three core elements of AI are algorithms, computing power, and data. Devices utilize various AI algorithms and massive amounts of sample data to achieve different AI functions through the computing power of chips on the device.

AIOps functions

AI ECN

AI for Explicit Congestion Notification (AI ECN) uses AI algorithms and data models to dynamically predict the optimal ECN threshold for the queue. Devices transmit packets with ECN markers based on the optimized threshold to reduce network congestion. This ensures low latency and high throughput in complex network environments.

AI for log aggregation and root cause analysis

This function aggregates all the log information received by the information center within a certain period of time based on the correlation between the logs. Based on the log aggregation results and the root cause analysis library, it infers the possible causes of faults and outputs a root cause analysis file for users to reference. For more information about this function, see "Configuring AI for log aggregation and root cause analysis."

AI for device anomaly detection

This function uses algorithms for anomaly detection on CPU usage or usage of other resources. It outputs alarm messages based on the detected anomaly. Compared to manually configured usage alarm thresholds, this function can detect anomalies more accurately. For more information about this function, see "Configuring AI for device anomaly detection."

AI models

AI functions often require training on extensive laboratory data to generate a reliable algorithm model. You can copy the algorithm model file directly to the device after downloading it from the specified official website, or use the analyzer to push it to the device through Telemetry.

After you load a model file to the device, the enabled AI services automatically read the model file based on its filename. They then infer the optimal configuration values by using the algorithm model file, thereby implementing the AI functionalities.

License requirements for AI model file loading

The AI model file loading feature is limited by licenses. To use this feature, first install the corresponding licenses. For more information, see license management in Fundamentals Configuration Guide.

Managing AI model files

Restrictions and guidelines

To ensure that different AI functions can correctly identify their respective model files, the filenames obtained from the official website or analyzer follow certain naming conventions. As a best practice, do not manually change the filenames, as doing so might result in AI function being unable to recognize the model files.

Take a model file named h3caiecn-1-dqn-001 as an example:

·     The aiecn string represents the process name associated with the AI function.

·     A value of 1 indicates that the model file is generated through analyzer cloud pushing, while a value of 0 means that the file is manually downloaded from the official website.

·     The dqn string that denotes the name of the algorithm and other descriptions used by the AI function. An AI function can be implemented based on various AI algorithms.

·     The 001 string indicates the version number.

If the currently loaded AI model file is not applicable to the device's live environment, you can unload the model file by using a command.

Procedure

1.     Enter system view.

system-view

2.     Enter AI service view.

ai-service

3.     Load an AI model file.

model load file-name

By default, no AI model files exist.

To load an AI model file, first save the AI model file in the device's storage space.

4.     (Optional.) Unload an AI model file.

model unload file-name

Display and maintenance commands for AI models

Execute display commands in any view.

 

Task

Command

Display information about the loaded AI model files.

display ai-service model

 


Configuring AI for log aggregation and root cause analysis

About AI for log aggregation and root cause analysis

Overview

The information center can receive log information generated by all modules and classify and manage the log information according to modules and log levels. However, users still cannot quickly find key information from massive log information or perform accurate and efficient fault location based on the key log information.

This function can implement the following:

·     The AI process aggregates all log information received by the information center over a period of time based on log correlation. It generates a summary file according to the log aggregation result, simplifying log information. For example, multiple logs from a main interface and its subinterfaces caused by a main interface fault will be aggregated into one log summary.

·     The AI process derives potential fault causes from aggregated log results and fault root cause model files in devices, and exports root cause analysis files as references for users.

Data source

The data source for AI log aggregation and root cause analysis is the log information generated by this device. This function obtains log information from the information center of the device.

After you enable the information center function on device and configure local loopback address 127.0.0.1 as the log host address, the AI process listens to log information of the information center on local UDP port 514.

Format of the log summary file

Fields in the log summary file

You can use the display ai ai-fault-analysis summary command to view information in the log summary file. Table 1 shows the meaning for each field in the file.

Table 1 Fields in the log summary file

Field

Description

Example

Remarks

Time window

The time span of the log summary.

2021-03-08 22:19:00 to 22:19:59

Expressed in the form of year, month, day, hours, minutes, and seconds.

Device address

Device IP address.

77.1.1.4

The device information includes device address and device name. It is located after device.

Device name

Device name corresponding to the log summary.

(S7503E)

Log event

Summary information of the log.

Transceiver on Ten-GigabitEthernet1/2/0/23 is NOT sold by H3C.

The content of the log event comes after the prompt encountered the following events:.

Impact

Potential impact of all logs within the time window.

Impact: Flow stability on Ten-GigabitEthernet1/2/0/23.

·     When there is no impact, this field is empty.

·     No impact entry number is provided if only one impact entry exists.

·     Impact entry numbers are provided if two or more impact entries exist.

Aggregated entries

The total number of logs aggregated for the log summary.

Aggregated entries: 3.

The number of logs included the log summary.

Highest severity level

The highest severity level of aggregated logs.

The highest severity level: Error.

The highest severity level of aggregated logs.

Example log summary file

The following displays a sample of a log summary file:

·     There is only one log event.

2021-03-08 22:15:00 to 22:15:59, device 80.0.0.13(16X-B) encountered the following events: Interface Ten-GigabitEthernet11/2/1 link down. Aggregated entries: 7. Highest severity level: Error.

·     There are multiple log events.

2021-03-08 22:17:00 to 22:17:59, device 80.0.0.13(16X-B) encountered the following events: 1. Interface Ten-GigabitEthernet5/2/2.2 physical down. 2. Interface Ten-GigabitEthernet5/2/2.1 physical down. 3. Interface Ten-GigabitEthernet5/2/2.3 physical down. Aggregated entries: 3. Highest severity level: Error.

·     No impact.

2021-03-08 22:22:00 to 22:22:59, device 192.28.200.201(12508 W) encountered the following events: Many MAC addresses moved from port Ten-GigabitEthernet4/0/5:1 to port Ten-GigabitEthernet4/0/5:2. Aggregated entries: 5. Highest severity level: Warning.

·     There is only one impact entry.

2021-03-08 22:19:00 to 22:19:59, device 77.1.1.4(S7503E) encountered the following events: 1. Transceiver on Ten-GigabitEthernet1/2/0/23 is NOT sold by H3C. 2. Transient physical state change on interface Tunnel4. 3. Interface Tunnel4 physical up. Impact: Flow stability on Ten-GigabitEthernet1/2/0/23. Aggregated entries: 22. Highest severity level: Error.

·     There are multiple impact entries.

2021-03-08 22:21:00 to 22:21:59, device 192.28.200.201(12508 W) encountered the following events: 1. Fan 1 failed. 2. Slot 2 rebooting. Impact: 1. IS-IS neighbor down. 2. OSPF neighbor down. 3. OSPFv3 neighbor down. Aggregated entries: 9. Highest severity level: Critical.

Format of the root cause analysis file

Fields in the file

You can use the display ai ai-fault-analysis root-cause command to view the root cause analysis file. Table 2 shows the meaning of each field in the file.

Table 2 Fields in the file

Field

Description

Example

Remarks

Time

Expressed in year, month, day, hours, minutes, and seconds.

"2021-03-08 22:15:00"

Fault occurrence time.

Fault description

Summary information of a fault alarm.

fault BFD_Session_Down occurred on (device=77.1.1.41, session=21.2.1.1/21.2.1.2).

The information includes a brief description of the fault, the address of the faulty device, and service information.

Fixed description field

There are two fixed description fields based on the presence of a suspected root cause.

·     root issue no causes found

·     Possible root cause

Possible root cause:

There might be faults that AI processes are unable to detect.

Severity level

The severity level of a fault or the severity level of the log indicating the root cause of the fault.

severity=Error

If there is no root cause, the severity level of the fault is displayed. If there is a suspected root cause, the severity level of the fault's log is displayed.

Possibility of the suspected root cause

Possibility of the suspected root cause.

probability=0.8

This information is not displayed when there is no root cause.

The time when the fault's root cause occurred.

A logged event might be the root cause of the fault. This information is the time when the event was logged.

2021/3/8 22:21:52

This information is not displayed when there is no root cause.

Device address for the fault

A logged event might be the root cause of the fault. This information is the IP address of the device where the event was logged.

192.28.200.201

The device information includes device address and device name. It is located after device.

Device name for the root cause of the fault

A logged event might be the root cause of the fault. This information is the name of the device where the event was logged.

(12508 W)

Network elements involved in the root cause of the fault

The specific interface or slot information related to the root cause of the fault.

(device=192.28.200.201, mdc=1, chassis=0, slot=2)

The network elements are located after encountered xxxx on.

Details

Provide a detailed description of the fault cause or the fault itself.

Details: Board is rebooting on slot 2.

There are two types of descriptions depending on the presence of root cause.

·     The detailed description of the fault appears when there is no root cause.

·     If there is a suspected root cause, it will be displayed as a detailed root cause description.

Output sample

·     No root cause:

2021-03-09 10:43:30, fault BFD_Session_Down occurred on (device=77.1.1.41, session=21.2.1.1/21.2.1.2). root issue [severity=Notification], no causes found. Details: Sess[21.2.1.1/21.2.1.2, LD/RD:4128/4119, Interface:Vlan202, SessType:Ctrl, LinkType:INET], Ver:1, Sta: UP->DOWN, Diag: 3 (No Diagnostic)

·     There is only one root cause.

2021-03-08 22:21:43, fault OSPF_Neighbor_Down occurred on (device=192.28.200.201, route=OSPF, ospfId=1). Possible root cause: [severity=Notification, probability=0.8] 2021-03-08 22:21:52, device 192.28.200.201(12508 W) encountered DEV_BOARD_REBOOT on (device=192.28.200.201, mdc=1, chassis=0, slot=2). Details: Board is rebooting on slot 2.

·     There are multiple root causes.

2021-03-08 22:20:30, fault Interface_Flapping occurred on (device=77.1.1.4, mdc=1, port=Tunnel4). Possible root causes: 1. [severity=Error, probability=1.0] 2021-03-08 22:20:30, device 77.1.1.4(S7503E) encountered IFNET_TNL_PHY_UPDOWN_DERIVE on (device=77.1.1.4, mdc=1, port=Tunnel4). Details: Transient physical state change between up and down occurred on interface Tunnel4. 2. [severity=Warning, probability=0.7] 2021-03-08 22:19:26, device 77.1.1.4(S7503E) encountered OPTMOD_PHONY_MODULE on (device=77.1.1.4, mdc=1, chassis=1, slot=2, port=Ten-GigabitEthernet1/2/0/23). Details: Ten-GigabitEthernet1/2/0/23: This transceiver is NOT sold by H3C. H3C therefore shall NOT guarantee the normal function of the device or assume the maintenance responsibility thereof!

AI for log aggregation and root cause analysis tasks at a glance

To configure AI for log aggregation and root cause analysis, perform the following tasks:

1.     Enabling AI for log aggregation and root cause analysis

2.     Saving the log aggregation summary file or root cause analysis file

Prerequisites

Enable the information center and configure local loopback address 127.0.0.1 as the log host address. UDP port 514 is used as the listening port and cannot be modified.

Enabling AI for log aggregation and root cause analysis

1.     Enter system view.

system-view

2.     Enter AI service view.

ai-service

3.     Enable AI for log aggregation and root cause analysis.

ai ai-fault-analysis enable

By default, AI for log aggregation and root cause analysis is disabled.

Saving the log aggregation summary file or root cause analysis file

To save the log aggregation summary file, root cause analysis file, or log file, execute the following command in any view:

ai ai-fault-analysis save { logfile | root-cause | summary }

By default, the system does not save the log aggregation summary file or root cause analysis file.

Display and maintenance commands for AI for log aggregation and root cause analysis

Execute display commands in any view.

 

Task

Command

Display information in the log aggregation summary file and root cause analysis file.

display ai ai-fault-analysis { logfile | root-cause | summary } [ tail line-number ]

 

 


Configuring AI for device anomaly detection

About AI for device anomaly detection

Overview

When managing and maintaining devices, you can manually configure alarm thresholds for different resources. For example, you can execute the resource-monitor resource command to configure alarm thresholds for CPU or ARP resources. The system will generate alarm messages when the specified thresholds are reached. However, these types of alarm messages cannot reflect changes in resource usage trends. Additionally, misconfiguring alarm thresholds can interfere with fault diagnosis and hinder the development of future AIOps methods.

This function uses AI algorithms to infer whether there are anomalies in the usage of various resources and table entries on the device. It triggers alarm messages after determining anomalies, which is more scientific compared to the traditional way of triggering alarms through static alarm threshold settings.

Operating principle

This function supports monitoring the CPU usage and usage of various table resources. You can use the display resource-monitor command to view resource monitoring information.

The data collected varies based on the device model. This function uses a maximum of three different algorithms to dynamically detect anomalies in resource usage for the current performance metrics. The main operating mechanism includes:

1.     Data collection: The AI anomaly detection process collects resource usage data of different indicators on the device through NETCONF. Each package contains key information such as device information, timestamp, and resource usage of indicators.

2.     In the first round of AI algorithm detection, the AI anomaly detection process learns from historical indicators collected from the device to analyze their change trend and infer the dynamic threshold range model. The AI anomaly detection process can also directly load AI model files to generate the dynamic threshold range model for data indicators. The AI anomaly detection process detects real-time data from the network based on the model. If the collected data falls within the model's threshold range, there is no anomaly. Otherwise, the data is marked as abnormal. The AI anomaly detection process can automatically select the appropriate first-round algorithm based on the fluctuation and trend characteristics of different indicators. The blue region in Figure 4 indicates the dynamic threshold range for normal indicators.

Figure 1 Algorithm for the first round of anomaly detection

 

3.     In the second round of AI algorithm detection, indicators flagged as anomalies by the first round will undergo a second anomaly detection algorithm. The second round of AI algorithm detection is based on a voting anomaly detection mechanism using multiple unsupervised algorithms. Based on the statistical characteristics of the data, various AI algorithms are used for anomaly decision. Indicators identified as anomalies might require a third round of AI algorithm detection depending on the device and type of indicators.

4.     The third round of AI algorithm testing based on unsupervised deep learning algorithm (deepar) in device chips.

5.     The AI anomaly detection process generates an alarm for abnormal usage only when all rounds of anomaly detection show abnormal results. Using multiple rounds of AI algorithm detection improves accuracy and intelligence of anomaly detection.

AI for device anomaly detection tasks at a glance

To configure AI for device anomaly detection, perform the following tasks:

1.     Enabling AI for device anomaly detection

2.     Saving the log file for AI-based device anomaly detection

Enabling AI for device anomaly detection

1.     Enter system view.

system-view

2.     Enter AI service view.

ai-service

3.     Enable AI for device anomaly detection.

ai key-resource-monitor enable

By default, AI for device anomaly detection is disabled.

Saving the log file for AI-based device anomaly detection

About this task

The log file for AI-based device anomaly detection can help operations and technical support analyze the effectiveness of the AI-based device anomaly detection feature. The log for AI-based device anomaly detection contains usage information of various metrics collected and the basis for multiple rounds of AI algorithm decisions.

Procedure

Execute this command in any view to save the log file for the AI-based device anomaly detection feature.

ai key-resource-monitor save logfile

By default, the system does not automatically save the log file for the AI-based device anomaly detection feature.

Display and maintenance commands for AI-based device anomaly detection

Execute display commands in any view.

 

Task

Command

Display the running log for AI-based device anomaly detection.

display ai key-resource-monitor logfile [ tail line-number ]

 

 

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网