Download Book

H3C Unified Platform Troubleshooting Guide-E07xx-5W103-book.pdf(1.84 MB)

Released At: 12-08-2025
Page Views:
Downloads:

Table of Contents

H3C Unified Platform Troubleshooting Guide-E07xx-5W103

Related Documents


H3C Unified Platform
Troubleshooting Guide

Document version: 5W103-20250811

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.

Contents

Introduction· 1

General guidelines· 1

Collecting failure running information· 1

Collecting running logs· 1

Collecting deployment/upgrade logs· 2

Collecting runtime data· 3

Contacting technical support 4

Troubleshooting cluster nodes· 5

Node hardware failure· 5

Symptom·· 5

Solution· 5

Unified Platform unavailable because of cluster node exceptions· 5

Symptom·· 5

Solution· 6

Containers get stuck in evicted state because of insufficient disk space· 6

Symptom·· 6

Solution· 6

K8s node gets stuck in Not Ready state after a hugepages size change· 6

Symptom·· 6

Solution· 7

Cluster network mode change failure· 8

Symptom·· 8

Solution· 8

Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade· 8

Symptom·· 8

Solution· 8

Error occurrence on Pods including calico-node with error message Delegation not available for unit type· 10

Symptom·· 10

Solution· 10

The system node experiences prolonged power failure or exceptions, causing PostgreSQL data directories on other nodes to consume excessive disk space· 10

Solution· 10

Some nodes become NotReady and many Pods become abnormal upon network restoration for the primary site in an RDRS environment 11

Symptom·· 11

Solution· 11

Troubleshooting blocked access to Matrix services· 13

The cluster rejects access to any Matrix service if the security policy drops all access requests· 13

Symptom·· 13

Solution· 13

Troubleshooting Matrix login failure because of incorrect password· 14

Symptom·· 14

Solution· 14

Other users except admin fail to log in to Matrix when they enter incorrect passwords· 15

Symptom·· 15

Solution· 15

Troubleshooting default route loss· 16

Default routes get lost after the ifconfig command restarts the NIC· 16

Symptom·· 16

Solution· 16

Troubleshooting ETCD service failure· 17

ETCD service startup failure· 17

Symptom·· 17

Solution· 18

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk· 23

Symptom·· 23

Solution· 23

Troubleshooting docker service failure· 25

No response after executing docker commands· 25

Symptom·· 25

Solution· 25

The page displays that the Docker component is abnormal 25

Symptom·· 25

Solution· 26

Troubleshooting failures caused by power cycle of the server or network disconnection· 27

Files in the operating system were lost after power cycle of the server 27

Symptom·· 27

Solution· 27

Files where Matrix is dependent were lost after power cycle of the server 28

Symptom·· 28

Solution· 28

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server 29

Symptom·· 29

Solution· 29

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server 30

Symptom·· 30

Solution· 30

After you power off and restart the server where the node resides, the Prometheus data file becomes corrupted, causing abnormal Pod status· 30

Symptom·· 30

Solution· 31

Power cycle of the node server results in unstable IPv6 NIC for the MACVLAN additional network· 31

Symptom·· 31

Solution· 32

Power cycle of the node server results in repeated restarts of the Pod that uses the additional network· 32

Symptom·· 32

Solution· 32

Troubleshooting deployment failure· 33

Cluster deployment failed and a K8SINSTALL-ERROR occurred· 33

Symptom·· 33

Solution· 33

Unified Platform deployment failed because the kubectl exec command failed· 33

Symptom·· 33

Solution· 34

Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC· 34

Symptom·· 34

Solution· 34

Troubleshooting Unified Platform access failure· 35

Slow request processing caused by ETCD I/O latency· 35

Symptom·· 35

Solution· 35

Unified Platform login failure due to password loss· 35

Troubleshooting GlusterFS anomalies· 38

GlusterFS deployment fails because the disk or disk partition it uses has residual data· 38

Symptom·· 38

Cause analysis· 38

Solution· 38

Failed to install components that use GlusterFS storage because storage volumes cannot be deleted· 39

Symptom·· 39

Cause analysis· 39

Solution· 39

Failed to upgrade components that use GlusterFS storage because glusterd exits· 39

Symptom·· 39

Solution· 40

GlusterFS service anomaly occurred after Matrix rebuilding by using the ISO image· 40

Symptom·· 40

Analysis· 40

Solution· 40

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlocks in GlusterFS· 41

Symptom·· 41

Solution· 42

After you restart a node or the network changes, the mount directories of some pods that use the GlusterFS storage might have question marks (???) in the file content and the file content cannot be read or written· 42

Symptom·· 42

Cause analysis· 42

Solution· 42

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlock in GlusterFS· 43

Symptom·· 43

Solution· 43

In the GlusterFS disaster recovery environment, the primary and backup clusters experience data synchronization issues due to the absence of scheduled tasks· 44

Symptom·· 44

Solution· 44

Troubleshooting IP editing failure· 45

Failed to edit the VIPs· 45

Symptom·· 45

Solution· 45

Troubleshooting image corruption· 46

Image corruption· 46

Symptom·· 46

Solution· 46

Image layer verification failed· 47

Symptom·· 47

Solution· 47

Troubleshooting PXC failures· 48

Failed to start the PXC database· 48

Symptom·· 48

Solution· 48

Disk file corruption of PXC database· 49

Symptom·· 49

Solution· 49

PXC database in abnormal state· 49

Symptom·· 49

Cause analysis· 50

Solution· 51

Loss of all contents of PXC database startup file grastate.dat 51

Symptom·· 51

Solution· 52

Power outage resulting in loss of all disk file content required to start the PXC database· 53

Symptom·· 53

Solution· 53

Troubleshooting server restart issues· 55

Power outage resulting in ZooKeeper data sync failure between three nodes· 55

Symptom·· 55

Solution· 55

Server power outage resulting in damage of the XFS file and Vertica database· 55

Symptom·· 55

Solution· 55

Kafka startup failure due to ZooKeeper data sync issue upon restart 56

Symptom·· 56

Solution· 56

Forcible power shutdown resulting in damage to the operating system's XFS partition upon server restart and system anomalies· 57

Symptom·· 57

Solution· 57

Troubleshooting Unified Platform upgrade failures· 58

After you upgrade Unified Platform, the k-eureka pod and others are abnormal and in Pending state. 58

Symptom·· 58

Solution· 58

Troubleshooting RDRS· 60

If you delete the RDRS on the primary site after the primary and backup sites are disconnected, residual configurations exist on the backup site. 60

Symptom·· 60

Solution· 60

During the process of primary/backup switchover, a network issue between the primary and backup sites caused the new backup site to become inaccessible. 60

Symptom·· 60

Solution· 61

The primary site cannot automatically switch to the backup role though the backup site successfully takes over in automatic switchover mode· 61

Symptom·· 61

Solution· 61

Components in primary state on both the primary and backup sites and service exceptions· 62

Symptom·· 62

Solution· 62

Restart of some nodes in primary/backup sites, and PXC data synchronization anomaly· 62

Symptom·· 62

Solution· 62

Troubleshooting Kafka anomalies· 63

Service anomalies when the Kafka instance is in Running state· 63

Symptom·· 63

Solution· 64

Abnormal Kafka instance status upon system power outage or restart 65

Symptom·· 65

Solution· 65

Unexpected power outage resulting in Kafka data loss· 66

Symptom·· 66

Solution· 66

Troubleshooting Redis· 67

itom-central-redis cluster data synchronization anomaly· 67

Symptom·· 67

Solution· 67

Introduction

This document provides information about troubleshooting common software and hardware issues with Unified Platform.

General guidelines

To help identify the cause of issues, collect system and configuration information, including:

· Unified Platform version, Matrix version, and operating system version.

· Symptom, time of failure, and configuration.

· Network topology information, including network diagram, port connections, and points of failure.

· Log messages and diagnostic information. For more information about collecting this information, see "Collecting failure running information."

· Steps you have taken and the effect.

Collecting failure running information

Perform the following tasks to collect the running information of Unified Platform.

Collecting running logs

1. Enter the GUI login address of Unified Platform (in the format of http://ip_address:30000/central/index.html) in your browser, and then press Enter.

2. Enter the username and password, and then click Login to enter the GUI home page of Unified Platform.

3. Click System > Log Management > Running Logs to enter the Running Logs page. Then, select Global Logs or node-specific log and perform the following tasks:

¡ To view logs in the specific directory in a time span, specify the directory and the start time and end time of the time span.

¡ Search for logs of a specific module by using the file or directory name. For example:

- To search for alarm logs, enter keyword itom-alarm.

- To search for health check logs, enter keyword occ.

- To search for backup and restoration logs, enter keyword backup_recovery.

- To search for dashboard logs, enter keyword dashboard.

- To search for resource permission logs, enter keyword k-ures, k-permission, or k-framework.

¡ To export running logs to the local device, select specific or all log files, and then click Export.

Figure 1 Running Logs

Collecting deployment/upgrade logs

Exporting logs on the GUI

1. Enter the Matrix login URL (in the format of https://ip_address:8443/matrix/ui) in the address bar of the browser. Enter the username and password, and click Login.

2. Click the ... icon in the upper right corner of the page, and then click Export Log. In the confirmation dialog box that opens, click OK.

Figure 2 Clicking Export Log

Figure 3 Clicking OK

3. After you export the logs, you can view the exported logs on the download page of the browser.

Exporting logs from the backend

If the exported log file is too large, you can select to export the deployment/upgrade logs in the backend. More specifically, use FTP tools such as WinSCP and MobaXterm to download the log file matrix.log from the /var/log/matrix-diag/Matrix/Matrix/ directory on each of the three nodes to your local host.

Collecting runtime data

Use the backend script /opt/matrix/tools/matrix_log_collection.sh to collect real-time operational data. When network or traffic-related issues occur, you must execute this script on all nodes to collect data. Because log collection will occupy some disk space, make sure enough free space is available on your disk before you execute the script.

1. Log in to the backend of each node and execute the sudo bash /opt/matrix/tools/matrix_log_collection.sh command. During the script execution process, you need to enter Y multiple times to confirm the operation.

Figure 4 Executing the script

2. After you execute the script, a compressed file named matrix-timestamp.tar.gz will be created in the /home/matrix-log-collect directory. Export this file for troubleshooting.

Figure 5 Checking the /home/matrix-log-collect directory

Contacting technical support

If an issue persists after you perform the troubleshooting procedures in this document, contact H3C Support. When you contact an authorized H3C support representative, be prepared to provide information described in "General guidelines." This information will help the support engineer assist you as quickly as possible.

You can contact H3C Support at [email protected].

Troubleshooting cluster nodes

Node hardware failure

Symptom

A node in the cluster fails and needs to be replaced.

Solution

If a node in the cluster fails because of a hardware error and cannot be recovered, replace it with a new server.

1. Configure the replacement node server to have the exact same hostname, network card name, node IP address, username, password, RAID mode, and disk partition setting as the original faulty node.

2. Install on the replacement node server the same version of Matrix software as the cluster nodes. For more information, see H3C Unified Platform Deployment Guide.

3. Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon for the target node, and select Repair to repair the node.

4. Verify that the target node is normal after it is rebuilt.

5. If the issue persists, contact the support.

Unified Platform unavailable because of cluster node exceptions

Symptom

The following issues are all present:

· Failed to log in to Unified Platform.

· On the Matrix platform, a master node is in red and you cannot ping the abnormal node.

· Pods in running state reside on the abnormal node.

· Execute the kubectl get endpoints -nservice-software itom-central-login-svc command to identify the endpoints for the itom-central-login service. If the Pod IP addresses on the abnormal node remain on these endpoints, an exception occurs.

Figure 6 Identifying endpoints for the itom-central-login service

Solution

1. Access the CLI of the abnormal node and execute the kubectl drain nodeName --ignore-daemonsets --force --delete-local-data --timeout=1800s command to remove all Pods from the abnormal node.

The nodeName argument represents the name of the abnormal node, which can be obtained by using the hostname command.

2. Execute the kubectl delete node nodeName command to remove the abnormal node.

3. Repair the nodes that are abnormally disconnected. If the server hardware failure cannot be recovered, replace the server.

4. Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon for the target node, and select Rebuild to rebuild the node.

5. If the issue persists, contact the support.

Containers get stuck in evicted state because of insufficient disk space

Symptom

If the kubectl get pods --all-namespaces command is executed on a node with full disk space, a large number of containers on the node enters evicted state. These containers stay in the state even if disk space is manually released.

Solution

This symptom might occur if the disk space is insufficient on a node. In this case, the K8s mechanism generates a large number of containers in evicted state.

To resolve the issue:

1. Release disk space in the root partition to reduce disk usage.

For example, manually delete the compressed log packages from the /var/log directory and old installation packages from the /opt/matrix/app/install/packages/ directory.

2. Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon for the target node, and select Repair to repair the node. The K8s mechanism will automatically delete containers in evicted state.

3. If the issue persists, contact the support.

K8s node gets stuck in Not Ready state after a hugepages size change

Symptom

After you change the hugepages size, for example, changing the value of the GRUB_CMDLINE_LINUX parameter in file /etc/default/grub from "crashkernel=auto rhgb quiet default_hugepagesz=2M hugepagesz=2M hugepages=8192" to "crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16", the K8s node gets stuck in Not Ready state even after the system restarts.

Solution

Kubelet does not support huge pages of different sizes. When you change the size of huge pages, a new hugepages size directory is created in /sys/kernel/mm/hugepages/, where the previous hugepages size directory is still retained. This can cause K8s node state synchronization failure and therefore causes the K8s node to get stuck in Not Ready state.

This issue can be resolved by setting the number of huge pages of one size to 0. This section sets the number of 2M huge pages to 0.

To resolve the issue:

1. Edit the hugepages configuration file.

a. Use a vi editor to open the configuration file.

[root@node1 ~]# vi /etc/default/grub

b. Press i to enter edit mode and configure settings in the file as follows:

GRUB_TIMEOUT=5

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"

GRUB_DEFAULT=saved

GRUB_DISABLE_SUBMENU=true

GRUB_TERMINAL_OUTPUT="console"

GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet default_hugepagesz=1G

hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"

GRUB_DISABLE_RECOVERY="true"

c. Press ESC to exit edit mode.

d. Enter :wq and press Enter to save the changes.

2. Save configuration changes and restart the server for the changes to take effect.

¡ If the server started up in UEFI mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

¡ If the server started up in Legacy mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

3. Verify that the number of 2M huge pages has been changed to 0.

[root@node1 ~]# cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=UUID=f47e3128-e888-499e-b370-2b381b6f3134 ro crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0

4. If the issue persists, contact the support.

Cluster network mode change failure

Symptom

The network mode change for the cluster fails from the Matrix Web interface because of ETCD service exceptions on the master node.

Solution

The system initiates two ETCD requests for a network mode change, one for the change in calico and the other for the change on the Web interface. The Web interface prompts mode change failure if either request fails.

To resolve the issue:

1. Verify that the ETCD service has recovered. If the service has not recovered, contact the support.

[root@name1 1.0.0]# export ETCDCTL_API=2&&etcdctl cluster-health

member fb58b3b32bac01c is healthy: got healthy result from http:// matrix-node1:2379

member aa6e53b313aa741f is healthy: got healthy result from http:// matrix-node2:2379

member d1fcbe1f6db25390 is healthy: got healthy result from http:// matrix-node3:2379

2. If the network mode on the Web interface has been changed but the system prompts failure, change the network mode back to the previous mode. Make sure the mode has been changed back on the Web interface and then proceed to the next step.

3. Try again to set the target network mode.

Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade

Symptom

A node is in red after Matrix upgrade and the node details indicate that a kube-apiserver, kubeScheduler, or kubeControllerManager exception has occurred. After accessing the CLI of the abnormal node and executing the kubectl get pod -A -owide command, the command out shows that the abnormal nodes has Pods in CrashLoopBackOff state.

Solution

The solution for the failure varies by situation.

Situation 1

· Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding service ports are occupied and in LISTEN state.

· Cause analysis:

If the old processes do not exit after Matrix upgrade and kube-apiserver port 6443, kube-scheduler port 10251, or kube-controller-manager port 10252 is not released, new Pods cannot start up. You can access the CLI of the node and execute kubectl logs -n kube-system $pod_name or docker logs $container_id to obtain log information about the occupied ports.

· Solution:

The troubleshooting procedures are similar for the kube-scheduler, kube-apiserver, and kube-controller-manager services. This example troubleshoots the kube-scheduler service as follows:

a. Remove the kube-scheduler Pod.

[root@name ~]# mv /etc/kubernetes/manifests/kube-scheduler.yaml /opt/

b. Verify that all kube-scheduler containers have exited. If a containers does not exit after a long while, execute the docker rm -f $container_id to forcedly remove the container or execute the systemctl restart docker to restart the docker service.

[root@name ~]# docker ps | grep kube-scheduler

c. Execute the netstat -anlp | grep -w 10251 to verify that the port is released. The port is released if no connection in LISTEN state is displayed from the command output.

d. Start up the kube-scheduler Pod.

[root@name ~]# mv /opt/kube-scheduler.yaml/etc/kubernetes/manifests/

e. Execute the kubectl get pod -n kube-system -o wide command to obtain the Pod status.

f. If the issue persists, contact the support.

Situation 2

· Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding ports are occupied and only connections in TIME_WAIT state exist. In addition, the ports are not occupied by the kube-apiserver, kube-scheduler, or kube-controller-manager process.

· Cause analysis:

If the kube-apiserver, kube-scheduler, or kube-controller-manager Pod restarts during Matrix upgrade, port 6443, 10251, or 10252 might be occupied by GlusterFS. In this case, a Pod exception occurs.

· Solution:

Contact the support.

Error occurrence on Pods including calico-node with error message Delegation not available for unit type

Symptom

After you edit the IP address of a node on the Matrix platform, the node is in red. Access the CLI of the abnormal node, execute the kubectl get pod -A -owide command and the command output displays that some Pods including calico-node and calico-kube-controller are abnormal.

The kubelet logs are as follows:

Error syncing pod 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b ("calico-node-vlpz8_kube-system(991e112f-c3a3-4c46-9a9b-dfde4ca0a27b)"), skipping: failed to ensure that the pod: 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod991e112f-c3a3-4c46-9a9b-dfde4ca0a27b] : Delegation not available for unit type

Solution

This symptom might occur if the containerd version is lower than containerd-v1.3.0. To obtain the containerd version, execute the containerd -v command from the CLI of the node.

To resolve the issue, execute the systemctl restart kubelet.service command on the node where an abnormal Pod resides to restart the kubelet service.

The system node experiences prolonged power failure or exceptions, causing PostgreSQL data directories on other nodes to consume excessive disk space

Solution

Cause

In a PostgreSQL database cluster, the standby databases continuously synchronize data from the primary database. The synchronized data depends on the WAL logs in the primary database. To ensure proper data synchronization for all standby database Pods, the primary database reserves WAL logs unsynchronized by standby databases through it has automatic WAL log cleanup enabled. If a standby database node remains inactive for a long time while the PostgreSQL database keeps performing insert, delete, or update operations, the WAL log directory on the primary node will keep consuming more disk space. As shown in the following figure, the WAL log directory size increased from 97 MB to 11 GB.

As shown in the following figure, the WAL log directory size increased from 97 MB to 11 GB.

Solution

The primary PostgreSQL database reserves WAL logs to ensure that standby databases can synchronize data and operate normally. To resolve disk space issues, start the stopped node and make sure its PostgreSQL instance Pod runs properly. As the standby pod continuously synchronizes data, the primary database automatically clears WAL logs, gradually freeing up disk space. As shown in the following figure, the disk space occupied by WAL logs gradually decreases from 11 GB to 657 MB as the down node restarts.

Some nodes become NotReady and many Pods become abnormal upon network restoration for the primary site in an RDRS environment

Symptom

If you disconnect the network for the primary network of the RDRS environment, wait for a period of time, and then restore the network connection, some nodes at the primary site keep in NotReady state and large numbers of Pods become abnormal. The issue cannot be self-recovered.

Solution

The Docker log of the abnormal nodes contains a large number of lock requests and waits. After network restoration, process and Pod interlocks occur during the terminating and starting of Pods by Docker at the primary site, resulting in abnormal node state. This issue occurs with an extremely low probability.

To resolve the issue, execute the systemctl restart docker.service command at the backend of abnormal nodes to restart the Docker service.

Troubleshooting blocked access to Matrix services

The cluster rejects access to any Matrix service if the security policy drops all access requests

Symptom

If you enable a security policy with the default action set to DROP and all rules removed, the cluster rejects all access requests to any Matrix services.

Solution

This symptom occurs because the default action is set to DROP and the default rule that permits access to Matrix port 8443 is removed.

To resolve the issue:

1. Log in to a master node from the management port.

2. Access the disaster recovery directory.

[root@node1 ~]# cd /opt/matrix/k8s/disaster-recovery/

3. Restore the default security policy settings.

[root@node1 ~]# bash recover-security-policies.sh

4. If the issue persists, contact the support.

Troubleshooting Matrix login failure because of incorrect password

Login failure occurs because of incorrect password

Symptom

An admin user fails to log in to Matrix because of incorrect password.

Solution

To resolve the issue:

1. Access the script directory on a master node and execute the bash script_name reset_password command to run the password reset script. The script_name argument represents the script name and the reset_password argument represents the new password. The script name varies by matrix operating status. In this example, the new password is Pwd@123456.

¡ If the matrix operates correctly, run the password reset script named resetMatrixUserPassword.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/usermanage/reset_password

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 110 0 84 100 26 251 77 --:--:-- --:--:-- --:--:-- 252

+ return_info='{"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600}'

+ [[ {"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

¡ If the matrix cannot operate correctly, run the emergency password reset script named resetMatrixUserPassword_emergency.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword_emergency.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -H X-Access-Mode:emergen cy -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/use rmanage/reset_password

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 110 0 84 100 26 253 78 --:--:-- --:--:-- --:--:-- 253

+ return_info='{"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admi n","expireTime":600}'

+ [[ {"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admin","expire Time":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

2. Log in to Matrix again by using the new password.

3. If the issue persists, contact the support.

Other users except admin fail to log in to Matrix when they enter incorrect passwords

Symptom

On the Matrix login page, the non-admin users will fail to log in due to forgotten passwords or other reasons.

Solution

Execute the corresponding script based on your cluster conditions to reset the password.

· Reset the password when the cluster runs normally

a. Access the script directory of a master node. Execute the script by using the bash resetMatrixUserPassword.sh username reset_password command, where resetMatrixUserPassword.sh is the script name, username is the username, and reset_password is the new password. For example: bash resetMatrixUserPassword.sh test Pwd@123456.

[root@node1 ~]# cd /opt/matrix/k8s

[root@name0 k8s]# bash resetMatrixUserPassword.sh test Pwd@12345

Password reset to Pwd@12345 for user test succeeded.

b. After the script is executed, log in to Matrix with the new password.

· Reset the password in cluster emergency mode

a. ‍First, reset the admin password and repair the cluster by following the steps in "Login failure occurs because of incorrect password."

b. After the cluster recovers, reset the user password by using the method of resetting the password when the cluster runs normally in the same section.

Troubleshooting default route loss

Default routes get lost after the ifconfig command restarts the NIC

Symptom

The configured default routes get lost after you execute the ifconfig command on a node in the cluster to restart the NIC of the node.

Solution

To resolve the issue:

1. Access the CLI of the node and then use the systemctl restart network command to restart the network service.

[root@node01 ~]# systemctl restart network

2. Use the route -n command to verify that the default routes have been restored. The example output from this command is as follows:

[root@node01 ~]# route -n

Kernel IP routing table

Destination Gateway Genmask Flags Metric Ref Use Iface

0.0.0.0 10.99.212.1 0.0.0.0 UG 0 0 0 eth0

10.99.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0

169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 eth0

192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0

3. If the issue persists, contact the support.

Troubleshooting ETCD service failure

ETCD service startup failure

Symptom

ETCD service startup failure occurs because of ETCD storage data file corruption or file loss when a node is power cycled.

· After a node is power cycled, the ETCD service on it fails to start up because of database file corruption. This event eventually causes the cluster to malfunction.

The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd/ directory:

panic: freepages: failed to get all reachable pages (page 1407374894039040: out of bounds: 1264)

goroutine 116 [running]:

panic(0x55a1d6cce4a0, 0xc420202ef0)

/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:551 +0x3c5 fp=0xc42006bf60 sp=0xc42006bec0 pc=0x55a1d5f0ae25

github.com/coreos/bbolt.(*DB).freepages.func2(0xc42020c180)

...

· Data restoration failure causes file corruption.

Typically, the snap log file in the /var/lib/etcd/default.etcd/member/ directory must contain at least one log index value greater than the smallest log index value in the wal log file.

If all log index values in the wal log file are greater than those in the snap log file (as shown in Figure 7), the key operation log data will get lost and the ETCD service cannot recover data. In this case, file corruption occurs.

Figure 7 Viewing log index values in snap and wal log files

· After a node is power cycled, the ETCD service on it fails to start up because of database snapshot file loss. This event eventually causes the cluster to malfunction.

The following log is displayed:

etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist

· After a node is power cycled, the ETCD service on it fails to start up because of data file corruption. This event eventually causes abnormal node status.

The following log is displayed on the abnormal node:

"error":"walpb: crc mismatch"

Solution

To resolve the issue:

1. Log in to each node, and execute the systemctl status etcd command to verify that the ETCD service is in running state.

2. If the database file for the ETCD service of only one node is corrupted in standalone mode, use the procedure in "Solution for a standalone node."

3. If the database file for the ETCD service of only one node is corrupted in cluster mode, perform the following steps:

a. Log in to Matrix.

b. Click the icon in the upper right corner of the target node on the Deploy > Clusters > Deploy Cluster page, and then select Rebuild to rebuild the node.

4. If the database file for the ETCD service is corrupted on two nodes in cluster mode, Matrix enter the emergency mode. You can resolve the issue by rebuilding the nodes one after another.

5. If the database file for the ETCD service is corrupted on a three-node cluster, use the "Solution for a three-node cluster."

Solution for a standalone node

1. Log in to the node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master1 ~]# systemctl status etcd

2. Stop the Matrix service on the node.

¡ For a root account, execute the systemctl stop matrix command.

[root@master1 ~]# systemctl stop matrix

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3. Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service.

[root@master1 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4. Completely stop the ETCD service and remove the ETCD data directory.

¡ For a root account, execute the systemctl stop etcd command to completely stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[root@master1 ~]# systemctl stop etcd

[root@master1 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master1 ~]# ll /var/lib/etcd/

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to completely stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5. Access the ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/ to find the latest backup data file, such as Etcd_Snapshot_V900R001B06D012_20210805091547.db.

[root@master1 ~]# ll /opt/matrix/backup/etcd_backup_snapshot/

6. Navigate to the ETCD restoration script directory and execute the restoration operation. The Etcd_Snapshot_*_*.db file specified in the restoration script is the latest backup data file found in step 5.

7. Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

8. Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

¡ For a root account, execute the following commands:

[root@master1 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

¡ For a non-root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

9. Restart the ETCD service.

¡ For a root account, execute the systemctl restart etcd command.

[root@master1 ~]# systemctl restart etcd

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

10. Restart the Matrix service.

¡ For a root account, execute the systemctl restart matrix command.

[root@master1 ~]# systemctl restart matrix

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

11. Restore the kube-apiserver service.

[root@master1 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

12. Verify that the issue has been resolved.

a. Use the VIP to log in to the GUI of the Matrix platform.

b. On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status. The master node is in normal status, as shown in Figure 8.

Figure 8 One master node in normal status

13. Click the OBSERVE tab, and then select Workload from the menu to view the running status of the Pods. All Pods are in running status, as shown in Figure 9.

Figure 9 All pods on the Pod tab in running status

Solution for a three-node cluster

1. Log in to each master node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master2 ~]# systemctl status etcd

2. Stop the Matrix service on each master node.

¡ For a root account, execute the systemctl stop matrix command.

[root@master2 ~]# systemctl stop matrix

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3. Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service on each master node.

[root@master2 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4. Stop the ETCD service and remove the ETCD data directory on each master node.

[root@master2 ~]# systemctl stop etcd

[root@master2 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master2 ~]# ll /var/lib/etcd/

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5. Access the ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/ to find the latest backup data file, such as Etcd_Snapshot_V900R001B06D012_20210805091653.db. Make sure all nodes have the same backup data file. If a node lacks the file, you can copy the ETCD backup file from another node to it.

[root@master1 ~]# ll /opt/matrix/backup/etcd_backup_snapshot/

6. Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

For data consistency, make sure all nodes use the same backup file. If a node does not have the file, copy one from another node.

7. Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

¡ For a root account, execute the following commands:

[root@master2 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091653.db

2021-08-06 06:33:14.788657 I | mvcc: restore compact to 273930

2021-08-06 06:33:14.802137 I | etcdserver/membership: added member 312131d4535cc53f [http://10.99.212.124:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802189 I | etcdserver/membership: added member 5fc2f82d74297956 [http://10.99.212.123:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802206 I | etcdserver/membership: added member ad12c65048f444bd [http://10.99.212.120:2380] to cluster cd6d5adc1bfd16f5

¡ For a root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805014548.db

2021-08-06 01:22:10.876952 I | mvcc: restore compact to 12660679

2021-08-06 01:22:10.906116 I | etcdserver/membership: added member ac2cefc4cae84e25 [http://[2000::100:2000]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906174 I | etcdserver/membership: added member b4689a44b8c1f191 [http://[2000::100:2001]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906197 I | etcdserver/membership: added member c328a554c1ca84f4 [http://[2000::100:2002]:2380] to cluster ced7b5d5ee633b40

8. Restart the ETCD service on each master node.

¡ For a root account, execute the systemctl restart etcd command.

[root@master2 ~]# systemctl restart etcd

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

9. Restart the Matrix service on each master node.

¡ For a root account, execute the systemctl restart matrix command.

[root@master2 ~]# systemctl restart matrix

¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

10. Restore the kube-apiserver service on each master node.

[root@master2 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

11. Verify that the issue has been resolved.

a. Use the VIP to log in to the GUI of the Matrix platform.

b. On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status. The master nodes are in normal status, as shown in Figure 10.

Figure 10 Three master nodes in normal status

12. Click the OBSERVE tab, and then select Workload from the menu to view the running status of the Pods. All Pods are in running status, as shown in Figure 11.

Figure 11 All pods on the Pod tab in running status

13. If the issue persists, contact the support.

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk

Symptom

· An ETCD client, such as a K8s or Matrix client, takes more than 800 ms to access the ETCD database. The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd directory on each master node.

2020-11-15 12:36:42.013987 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:295" took too long (877.352309ms) to execute

2020-11-15 12:36:54.026221 W | etcdserver: read-only range request "key:\"/registry/pods/base-service/\" range_end:\"/registry/pods/base-service0\" " with result "range_response_count:42 size:107232" took too long (1.767232614s) to execute)

…

· Primary/standby switchover occurs frequently in the ETCD cluster, which might be caused by a heartbeat timeout.

Solution

This issue typically results from disk I/O performance degradation in a shared storage environment.

To resolve the issue:

1. If the symptom occurs during execution of an application installation, upgrade, or configuration deployment task and results in task execution failure, perform the task again to fix the issue.

The system does not remove data synchronized during the execution of an installation, upgrade or configuration deployment task. Task re-execution will be likely to succeed because the amount of data to be synchronized decreases, causing less impact on disk I/O performance.

2. If the symptom occurs during normal operation of the system, edit the matrixLeaderLeaseDuration (lease aging time) and matrixLeaderRetryPeriod (lease detection interval) parameters to delay the timeout for primary/backup switchovers.

However, this change will increase the failover time.

3. If writing failure or data loss occurs because of poor disk I/O performance, use one of the following methods to fix the issue:

¡ If a Pod has a status or communication anomaly, use the kubectl delete pod -n namespace podName command to remove the Pod. The system will create a Pod automatically to restore the ETCD data source.

¡ Method 2: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."

¡ Method 3:

- Uninstall Matrix from all nodes.

- Reinstall Matrix on all the nodes.

- Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.

Troubleshooting docker service failure

No response after executing docker commands

Symptom

No response is returned a long time after you execute the docker ps, docker images, docker inspect, and docker rmi commands.

Solution

1. Restart the docker service.

¡ For the root user, execute the following command to restart the docker service:

[root@master1 ~]# systemctl restart docker

¡ For a non-root user, execute the following command to restart the docker service:

[admin@master1 ~]$ sudo /bin/bash -c "systemctl restart docker"

2. Verify that the docker service becomes normal.

¡ For the root user, execute the docker images command to view the docker service.

¡ For a non-root user, execute the sudo /bin/bash -c " docker images " command to view the docker service.

The docker service becomes normal if the command output displays the image information of the current node.

Figure 12 Output from the docker images command

The page displays that the Docker component is abnormal

Symptom

Power cycling damages the system files in the var/lib/docker directory and causes the Docker service status to become abnormal.

Solution

To resolve this issue, reinstall the ISO image and rebuild the node.

Figure 13 Rebuilding the node

Troubleshooting failures caused by power cycle of the server or network disconnection

Files in the operating system were lost after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

· Contents in the chronyd.service, docker.service, and containerd.service files in the /usr/lib/systemd/system directory were lost.

· Contents in the chrony.conf, docker, etcd, hosts, and ssh configuration files in the /etc/ directory were lost. The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.

· The log files or contents in the log files in the /var/log directory were lost.

Solution

To resolve the issue:

· The contents in the chronyd.service, docker.service, and containerd.service files were lost:

a. Execute the ls /usr/lib/systemd/system/service-name.service command to identify whether the service file exists on each node or whether no information exists in the file.

b. If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c. If the file does not exist in any node, contact the support or re-install the operating system.

· The files or contents in the log files in the /etc/ and /var/log directories were lost:

Contact the support or re-install the operating system.

· The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.

In cluster mode, copy the deployenv.sh file from another Master node that has that file. If no master nodes have that file, rebuild the node. In standalone mode, contact the support or reinstall Matrix.

Files where Matrix is dependent were lost after power cycle of the server

Symptom

· The service file or contents in the service file for the etcd or matrix service were lost:

· The configuration files, for example, navigator_config.json or contents in the files in the /opt/matrix/ directory were lost.

· The script files, for example, docker.sh, or contents in the files in the /opt/matrix/ directory were lost.

· The Docker image files in the /var/lib/docker directory were damaged.

Symptom 1: Some pods were in ImagePullBackOff state. The event log for the describe pod is as follows:

error creating overlay mount to /var/lib/ /overlay2/698028ac124c9d0ef831f7d2d9506acd01faddaae6ea06a0a169fb352e0eddf4/merged: too many levels of symbolic links

Symptom 2: The following message was displayed:

time="2021-05-10T18:05:50.518918884+08:00" level=error msg="Handler for GET /containers/2494c1172314e37bd8250be06a24e0636b7427f89b3b5a5398ecfad7c2fe171d/json returned error: readlink /var/lib/docker/overlay2/l: invalid argument"

· The YAML file or contents in the YAML file in the /opt/matrix/ directory were lost.

Solution

To resolve the issue:

· The service file or contents in the service file were lost, or the files or contents in the files in the /opt/matrix/ directory were lost:

a. Execute the ls command to identify whether the service file exists on each node or whether no information exists in the file.

b. If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c. If the file does not exist in any node, contact the support or re-install Matrix.

· The Docker image files in the /var/lib/docker directory were damaged:

a. ‍Upload the Matrix package to rebuild the node.

b. Contact the support.

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during the Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

· The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

· Some Pods in CreateContainerError state on the OBSERVE > Workload page, indicating that product-related Pods are abnormal.

To view all Pods in CreateContainerError state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep CreateContainerError command.

[root@node1 home]# kubectl get pod -A -owide | grep CreateContainerError

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

kube-system calico-kube-controllers-cd96b6c89-hfz7s 0/1 CreateContainerError 0 29d 10.99.212.164 node1 <none> <none>

Solution

To resolve the issue, use either of the following methods:

· Method 1:

a. Access the node where the abnormal Pod resides and execute the docker ps | grep podname | grep -v POD | grep Up|awk '{print $1}' command to obtain the ID of the container in up state. Podname represents the name of the abnormal Pod.

[root@node1 home]# docker ps |grep calico-kube-controllers-cd96b6c89-hfz7s | grep -v POD|grep Up|awk '{print $1}'

c755b7812380

b. Execute the docker stop containerid && docker rm containerid command, for example, docker stop c755b7812380 && docker rm c755b7812380, to delete the container in up state.

c. Execute the kubectl get pod -A -owide | grep CreateContainerError command to check for Pods in CreateContainerError state. If such Pods still exist, log in to the GUI of the Matrix platform to rebuild the node.

· Method 2: Log in to the GUI of the Matrix platform and rebuild the node where the abnormal Pod resides.

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server

Symptom

· The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

· Some Pods in Error state on the OBSERVE > Workload page, indicating that product-related Pods are abnormal.

To view all Pods in Error state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command.

You can also access the CLI of the node where the Pod in Error state resides and execute the cat /var/log/matrix-diag/Matrix/kubelet/kubelet.log | grep "unexpected end of JSON input" command to view the kubelet logs about the node. If the following error message is displayed, Pod data is corrupted because of power cycle of the node and the Pod cannot start up.

Multus: failed to load netconf: unexpected end of JSON input

Solution

To resolve the issue, use either of the following methods:

· Method 1 (applicable to scenarios where a small number of Pods are in Error state):

a. Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command to view the namespaces and names of Pods in Error state.

b. Execute the kubectl delete pod -n namespace podName command to delete a Pod in Error state.

To delete multiple Pods in Error state, execute the command multiple times.

· Method 2 (applicable to scenarios where a large number of Pods are in Error state):

Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error| awk '{print $1 " " $2}'| xargs kubectl delete pod –n command. This command deletes all Pods in Error state.

After you power off and restart the server where the node resides, the Prometheus data file becomes corrupted, causing abnormal Pod status

Symptom

Execute the kubectl get pod -n monitor -owide | grep prometheus command to check the Prometheus Pod name and status. You will find Pods in CrashLoopBackOff state. Use the kubectl logs -f -n monitor prometheus-podname prometheus-server command to check the logs. The message that errorerr="opening storage failed: /data/xxx is displayed.

Solution

1. Use the rm -rf /var/lib/ssdata/imonitor/prometheus_data/ command to delete the Prometheus data files on the node where the abnormal Pod resides.

2. Copy the prometheus_data file from the node with a normal Pod to the node with an abnormal Pod. If all Pods are abnormal, delete the prometheus_data file on every node.

3. Restart the abnormal Pod.

Power cycle of the node server results in unstable IPv6 NIC for the MACVLAN additional network

Symptom

A Matrix cluster has been deployed. After the server of the node hosting the application is power cycled, the NIC becomes unavailable. The specific symptoms are as follows:

Execute the kubectl exec -it -n kube-system harbor-master1-6mvlb /bin/bash command to enter the container. Then execute the ip a command to check all NIC IPs within the container. If the state of the IPv6 NIC (for example, eth2@if3) in the MACVLAN additional network is tentative dadfailed, this IPv6 NIC is unavailable.

[root@vdhcpsrc1-6658fb96f4-j4n4f /]# ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo