H3C Unified Platform Troubleshooting Guide-E07xx-5W103

HomeSupportDiagnose & MaintainTroubleshootingH3C Unified Platform Troubleshooting Guide-E07xx-5W103
Download Book
  • Released At: 12-08-2025
  • Page Views:
  • Downloads:
Table of Contents
Related Documents

 

H3C Unified Platform

Troubleshooting Guide

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Document version: 5W103-20250811

 

Copyright © 2025 New H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.


Contents

Introduction· 1

General guidelines· 1

Collecting failure running information· 1

Collecting running logs· 1

Collecting deployment/upgrade logs· 2

Collecting runtime data· 3

Contacting technical support 4

Troubleshooting cluster nodes· 5

Node hardware failure· 5

Symptom·· 5

Solution· 5

Unified Platform unavailable because of cluster node exceptions· 5

Symptom·· 5

Solution· 6

Containers get stuck in evicted state because of insufficient disk space· 6

Symptom·· 6

Solution· 6

K8s node gets stuck in Not Ready state after a hugepages size change· 6

Symptom·· 6

Solution· 7

Cluster network mode change failure· 8

Symptom·· 8

Solution· 8

Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade· 8

Symptom·· 8

Solution· 8

Error occurrence on Pods including calico-node with error message Delegation not available for unit type· 10

Symptom·· 10

Solution· 10

The system node experiences prolonged power failure or exceptions, causing PostgreSQL data directories on other nodes to consume excessive disk space· 10

Solution· 10

Some nodes become NotReady and many Pods become abnormal upon network restoration for the primary site in an RDRS environment 11

Symptom·· 11

Solution· 11

Troubleshooting blocked access to Matrix services· 13

The cluster rejects access to any Matrix service if the security policy drops all access requests· 13

Symptom·· 13

Solution· 13

Troubleshooting Matrix login failure because of incorrect password· 14

Login failure occurs because of incorrect password· 14

Symptom·· 14

Solution· 14

Other users except admin fail to log in to Matrix when they enter incorrect passwords· 15

Symptom·· 15

Solution· 15

Troubleshooting default route loss· 16

Default routes get lost after the ifconfig command restarts the NIC· 16

Symptom·· 16

Solution· 16

Troubleshooting ETCD service failure· 17

ETCD service startup failure· 17

Symptom·· 17

Solution· 18

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk· 23

Symptom·· 23

Solution· 23

Troubleshooting docker service failure· 25

No response after executing docker commands· 25

Symptom·· 25

Solution· 25

The page displays that the Docker component is abnormal 25

Symptom·· 25

Solution· 26

Troubleshooting failures caused by power cycle of the server or network disconnection· 27

Files in the operating system were lost after power cycle of the server 27

Symptom·· 27

Solution· 27

Files where Matrix is dependent were lost after power cycle of the server 28

Symptom·· 28

Solution· 28

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server 29

Symptom·· 29

Solution· 29

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server 30

Symptom·· 30

Solution· 30

After you power off and restart the server where the node resides, the Prometheus data file becomes corrupted, causing abnormal Pod status· 30

Symptom·· 30

Solution· 31

Power cycle of the node server results in unstable IPv6 NIC for the MACVLAN additional network· 31

Symptom·· 31

Solution· 32

Power cycle of the node server results in repeated restarts of the Pod that uses the additional network· 32

Symptom·· 32

Solution· 32

Troubleshooting deployment failure· 33

Cluster deployment failed and a K8SINSTALL-ERROR occurred· 33

Symptom·· 33

Solution· 33

Unified Platform deployment failed because the kubectl exec command failed· 33

Symptom·· 33

Solution· 34

Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC· 34

Symptom·· 34

Solution· 34

Troubleshooting Unified Platform access failure· 35

Slow request processing caused by ETCD I/O latency· 35

Symptom·· 35

Solution· 35

Unified Platform login failure due to password loss· 35

Symptom·· 35

Solution· 36

Login anomalies caused by Ext4 file system damage· 36

Symptom·· 36

Solution· 37

Troubleshooting GlusterFS anomalies· 38

GlusterFS deployment fails because the disk or disk partition it uses has residual data· 38

Symptom·· 38

Cause analysis· 38

Solution· 38

Failed to install components that use GlusterFS storage because storage volumes cannot be deleted· 39

Symptom·· 39

Cause analysis· 39

Solution· 39

Failed to upgrade components that use GlusterFS storage because glusterd exits· 39

Symptom·· 39

Solution· 40

GlusterFS service anomaly occurred after Matrix rebuilding by using the ISO image· 40

Symptom·· 40

Analysis· 40

Solution· 40

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlocks in GlusterFS· 41

Symptom·· 41

Solution· 42

After you restart a node or the network changes, the mount directories of some pods that use the GlusterFS storage might have question marks (???) in the file content and the file content cannot be read or written· 42

Symptom·· 42

Cause analysis· 42

Solution· 42

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlock in GlusterFS· 43

Symptom·· 43

Solution· 43

In the GlusterFS disaster recovery environment, the primary and backup clusters experience data synchronization issues due to the absence of scheduled tasks· 44

Symptom·· 44

Solution· 44

Troubleshooting IP editing failure· 45

Failed to edit the VIPs· 45

Symptom·· 45

Solution· 45

Troubleshooting image corruption· 46

Image corruption· 46

Symptom·· 46

Solution· 46

Image layer verification failed· 47

Symptom·· 47

Solution· 47

Troubleshooting PXC failures· 48

Failed to start the PXC database· 48

Symptom·· 48

Solution· 48

Disk file corruption of PXC database· 49

Symptom·· 49

Solution· 49

PXC database in abnormal state· 49

Symptom·· 49

Cause analysis· 50

Solution· 51

Loss of all contents of PXC database startup file grastate.dat 51

Symptom·· 51

Solution· 52

Power outage resulting in loss of all disk file content required to start the PXC database· 53

Symptom·· 53

Solution· 53

Troubleshooting server restart issues· 55

Power outage resulting in ZooKeeper data sync failure between three nodes· 55

Symptom·· 55

Solution· 55

Server power outage resulting in damage of the XFS file and Vertica database· 55

Symptom·· 55

Solution· 55

Kafka startup failure due to ZooKeeper data sync issue upon restart 56

Symptom·· 56

Solution· 56

Forcible power shutdown resulting in damage to the operating system's XFS partition upon server restart and system anomalies· 57

Symptom·· 57

Solution· 57

Troubleshooting Unified Platform upgrade failures· 58

After you upgrade Unified Platform, the k-eureka pod and others are abnormal and in Pending state. 58

Symptom·· 58

Solution· 58

Troubleshooting RDRS· 60

If you delete the RDRS on the primary site after the primary and backup sites are disconnected, residual configurations exist on the backup site. 60

Symptom·· 60

Solution· 60

During the process of primary/backup switchover, a network issue between the primary and backup sites caused the new backup site to become inaccessible. 60

Symptom·· 60

Solution· 61

The primary site cannot automatically switch to the backup role though the backup site successfully takes over in automatic switchover mode· 61

Symptom·· 61

Solution· 61

Components in primary state on both the primary and backup sites and service exceptions· 62

Symptom·· 62

Solution· 62

Restart of some nodes in primary/backup sites, and PXC data synchronization anomaly· 62

Symptom·· 62

Solution· 62

Troubleshooting Kafka anomalies· 63

Service anomalies when the Kafka instance is in Running state· 63

Symptom·· 63

Solution· 64

Abnormal Kafka instance status upon system power outage or restart 65

Symptom·· 65

Solution· 65

Unexpected power outage resulting in Kafka data loss· 66

Symptom·· 66

Solution· 66

Troubleshooting Redis· 67

itom-central-redis cluster data synchronization anomaly· 67

Symptom·· 67

Solution· 67


Introduction

This document provides information about troubleshooting common software and hardware issues with Unified Platform.

General guidelines

To help identify the cause of issues, collect system and configuration information, including:

·     Unified Platform version, Matrix version, and operating system version.

·     Symptom, time of failure, and configuration.

·     Network topology information, including network diagram, port connections, and points of failure.

·     Log messages and diagnostic information. For more information about collecting this information, see "Collecting failure running information."

·     Steps you have taken and the effect.

Collecting failure running information

Perform the following tasks to collect the running information of Unified Platform.

Collecting running logs

1.     Enter the GUI login address of Unified Platform (in the format of http://ip_address:30000/central/index.html) in your browser, and then press Enter.

2.     Enter the username and password, and then click Login to enter the GUI home page of Unified Platform.

3.     Click System > Log Management > Running Logs to enter the Running Logs page. Then, select Global Logs or node-specific log and perform the following tasks:

¡     To view logs in the specific directory in a time span, specify the directory and the start time and end time of the time span.

¡     Search for logs of a specific module by using the file or directory name. For example:

-     To search for alarm logs, enter keyword itom-alarm.

-     To search for health check logs, enter keyword occ.

-     To search for backup and restoration logs, enter keyword backup_recovery.

-     To search for dashboard logs, enter keyword dashboard.

-     To search for resource permission logs, enter keyword k-ures, k-permission, or k-framework.

¡     To export running logs to the local device, select specific or all log files, and then click Export.

Figure 1 Running Logs

 

Collecting deployment/upgrade logs

Exporting logs on the GUI

1.     Enter the Matrix login URL (in the format of https://ip_address:8443/matrix/ui) in the address bar of the browser. Enter the username and password, and click Login.

2.     Click the ... icon in the upper right corner of the page, and then click Export Log. In the confirmation dialog box that opens, click OK.

Figure 2 Clicking Export Log

 

Figure 3 Clicking OK

 

3.     After you export the logs, you can view the exported logs on the download page of the browser.

Exporting logs from the backend

If the exported log file is too large, you can select to export the deployment/upgrade logs in the backend. More specifically, use FTP tools such as WinSCP and MobaXterm to download the log file matrix.log from the /var/log/matrix-diag/Matrix/Matrix/ directory on each of the three nodes to your local host.

Collecting runtime data

Use the backend script /opt/matrix/tools/matrix_log_collection.sh to collect real-time operational data. When network or traffic-related issues occur, you must execute this script on all nodes to collect data. Because log collection will occupy some disk space, make sure enough free space is available on your disk before you execute the script.

1.     Log in to the backend of each node and execute the sudo bash /opt/matrix/tools/matrix_log_collection.sh  command. During the script execution process, you need to enter Y multiple times to confirm the operation.

Figure 4 Executing the script

 

2.     After you execute the script, a compressed file named matrix-timestamp.tar.gz will be created in the /home/matrix-log-collect directory. Export this file for troubleshooting.

Figure 5 Checking the /home/matrix-log-collect directory

 

Contacting technical support

If an issue persists after you perform the troubleshooting procedures in this document, contact H3C Support. When you contact an authorized H3C support representative, be prepared to provide information described in "General guidelines." This information will help the support engineer assist you as quickly as possible.

You can contact H3C Support at [email protected].

 


Troubleshooting cluster nodes

Node hardware failure

Symptom

A node in the cluster fails and needs to be replaced.

Solution

If a node in the cluster fails because of a hardware error and cannot be recovered, replace it with a new server.

1.     Configure the replacement node server to have the exact same hostname, network card name, node IP address, username, password, RAID mode, and disk partition setting as the original faulty node.

2.     Install on the replacement node server the same version of Matrix software as the cluster nodes. For more information, see H3C Unified Platform Deployment Guide.

3.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Repair to repair the node.

4.     Verify that the target node is normal after it is rebuilt.

5.     If the issue persists, contact the support.

Unified Platform unavailable because of cluster node exceptions

Symptom

The following issues are all present:

·     Failed to log in to Unified Platform.

·     On the Matrix platform, a master node is in red and you cannot ping the abnormal node.

·     Pods in running state reside on the abnormal node.

·     Execute the kubectl get endpoints -nservice-software itom-central-login-svc command to identify the endpoints for the itom-central-login service. If the Pod IP addresses on the abnormal node remain on these endpoints, an exception occurs.

Figure 6 Identifying endpoints for the itom-central-login service

 

Solution

1.     Access the CLI of the abnormal node and execute the kubectl drain nodeName --ignore-daemonsets --force --delete-local-data --timeout=1800s command to remove all Pods from the abnormal node.

The nodeName argument represents the name of the abnormal node, which can be obtained by using the hostname command.

2.     Execute the kubectl delete node nodeName command to remove the abnormal node.

3.     Repair the nodes that are abnormally disconnected. If the server hardware failure cannot be recovered, replace the server.

4.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Rebuild to rebuild the node.

5.     If the issue persists, contact the support.

Containers get stuck in evicted state because of insufficient disk space

Symptom

If the kubectl get pods --all-namespaces command is executed on a node with full disk space, a large number of containers on the node enters evicted state. These containers stay in the state even if disk space is manually released.

Solution

This symptom might occur if the disk space is insufficient on a node. In this case, the K8s mechanism generates a large number of containers in evicted state.

To resolve the issue:

1.     Release disk space in the root partition to reduce disk usage.

For example, manually delete the compressed log packages from the /var/log directory and old installation packages from the /opt/matrix/app/install/packages/ directory.

2.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Repair to repair the node. The K8s mechanism will automatically delete containers in evicted state.

3.     If the issue persists, contact the support.

K8s node gets stuck in Not Ready state after a hugepages size change

Symptom

After you change the hugepages size, for example, changing the value of the GRUB_CMDLINE_LINUX parameter in file /etc/default/grub from "crashkernel=auto rhgb quiet default_hugepagesz=2M hugepagesz=2M hugepages=8192" to "crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16", the K8s node gets stuck in Not Ready state even after the system restarts.

Solution

Kubelet does not support huge pages of different sizes. When you change the size of huge pages, a new hugepages size directory is created in /sys/kernel/mm/hugepages/, where the previous hugepages size directory is still retained. This can cause K8s node state synchronization failure and therefore causes the K8s node to get stuck in Not Ready state.

This issue can be resolved by setting the number of huge pages of one size to 0. This section sets the number of 2M huge pages to 0.

To resolve the issue:

1.     Edit the hugepages configuration file.

a.     Use a vi editor to open the configuration file.

[root@node1 ~]# vi /etc/default/grub

b.     Press i to enter edit mode and configure settings in the file as follows:

GRUB_TIMEOUT=5

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"

GRUB_DEFAULT=saved

GRUB_DISABLE_SUBMENU=true

GRUB_TERMINAL_OUTPUT="console"

GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet default_hugepagesz=1G

hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"

GRUB_DISABLE_RECOVERY="true"

c.     Press ESC to exit edit mode.

d.     Enter :wq and press Enter to save the changes.

2.     Save configuration changes and restart the server for the changes to take effect.

¡     If the server started up in UEFI mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

¡     If the server started up in Legacy mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

3.     Verify that the number of 2M huge pages has been changed to 0.

[root@node1 ~]# cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=UUID=f47e3128-e888-499e-b370-2b381b6f3134 ro crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0

4.     If the issue persists, contact the support.

Cluster network mode change failure

Symptom

The network mode change for the cluster fails from the Matrix Web interface because of ETCD service exceptions on the master node.

Solution

The system initiates two ETCD requests for a network mode change, one for the change in calico and the other for the change on the Web interface. The Web interface prompts mode change failure if either request fails.

To resolve the issue:

1.     Verify that the ETCD service has recovered. If the service has not recovered, contact the support.

[root@name1 1.0.0]# export ETCDCTL_API=2&&etcdctl cluster-health

member fb58b3b32bac01c is healthy: got healthy result from http:// matrix-node1:2379

member aa6e53b313aa741f is healthy: got healthy result from http:// matrix-node2:2379

member d1fcbe1f6db25390 is healthy: got healthy result from http:// matrix-node3:2379

2.     If the network mode on the Web interface has been changed but the system prompts failure, change the network mode back to the previous mode. Make sure the mode has been changed back on the Web interface and then proceed to the next step.

3.     Try again to set the target network mode.

Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade

Symptom

A node is in red after Matrix upgrade and the node details indicate that a kube-apiserver, kubeScheduler, or kubeControllerManager exception has occurred. After accessing the CLI of the abnormal node and executing the kubectl get pod -A -owide command, the command out shows that the abnormal nodes has Pods in CrashLoopBackOff state.

Solution

The solution for the failure varies by situation.

Situation 1

·     Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding service ports are occupied and in LISTEN state.

·     Cause analysis:

If the old processes do not exit after Matrix upgrade and kube-apiserver port 6443, kube-scheduler port 10251, or kube-controller-manager port 10252 is not released, new Pods cannot start up. You can access the CLI of the node and execute kubectl logs -n kube-system $pod_name or docker logs $container_id to obtain log information about the occupied ports.

·     Solution:

The troubleshooting procedures are similar for the kube-scheduler, kube-apiserver, and kube-controller-manager services. This example troubleshoots the kube-scheduler service as follows:

a.     Remove the kube-scheduler Pod.

[root@name ~]# mv /etc/kubernetes/manifests/kube-scheduler.yaml /opt/

b.     Verify that all kube-scheduler containers have exited. If a containers does not exit after a long while, execute the docker rm -f $container_id to forcedly remove the container or execute the systemctl restart docker to restart the docker service.

[root@name ~]# docker ps | grep kube-scheduler

c.     Execute the netstat -anlp | grep -w 10251 to verify that the port is released. The port is released if no connection in LISTEN state is displayed from the command output.

d.     Start up the kube-scheduler Pod.

[root@name ~]# mv /opt/kube-scheduler.yaml/etc/kubernetes/manifests/

e.     Execute the kubectl get pod -n kube-system -o wide command to obtain the Pod status.

f.     If the issue persists, contact the support.

Situation 2

·     Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding ports are occupied and only connections in TIME_WAIT state exist. In addition, the ports are not occupied by the kube-apiserver, kube-scheduler, or kube-controller-manager process.

·     Cause analysis:

If the kube-apiserver, kube-scheduler, or kube-controller-manager Pod restarts during Matrix upgrade, port 6443, 10251, or 10252 might be occupied by GlusterFS. In this case, a Pod exception occurs.

·     Solution:

Contact the support.

Error occurrence on Pods including calico-node with error message Delegation not available for unit type

Symptom

After you edit the IP address of a node on the Matrix platform, the node is in red. Access the CLI of the abnormal node, execute the kubectl get pod -A -owide command and the command output displays that some Pods including calico-node and calico-kube-controller are abnormal.

The kubelet logs are as follows:

Error syncing pod 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b ("calico-node-vlpz8_kube-system(991e112f-c3a3-4c46-9a9b-dfde4ca0a27b)"), skipping: failed to ensure that the pod: 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod991e112f-c3a3-4c46-9a9b-dfde4ca0a27b] : Delegation not available for unit type

Solution

This symptom might occur if the containerd version is lower than containerd-v1.3.0. To obtain the containerd version, execute the containerd -v command from the CLI of the node.

To resolve the issue, execute the systemctl restart kubelet.service command on the node where an abnormal Pod resides to restart the kubelet service.

The system node experiences prolonged power failure or exceptions, causing PostgreSQL data directories on other nodes to consume excessive disk space

Solution

Cause

In a PostgreSQL database cluster, the standby databases continuously synchronize data from the primary database. The synchronized data depends on the WAL logs in the primary database. To ensure proper data synchronization for all standby database Pods, the primary database reserves WAL logs unsynchronized by standby databases through it has automatic WAL log cleanup enabled.  If a standby database node remains inactive for a long time while the PostgreSQL database keeps performing insert, delete, or update operations, the WAL log directory on the primary node will keep consuming more disk space. As shown in the following figure, the WAL log directory size increased from 97 MB to 11 GB.

As shown in the following figure, the WAL log directory size increased from 97 MB to 11 GB.

Solution

The primary PostgreSQL database reserves WAL logs to ensure that standby databases can synchronize data and operate normally. To resolve disk space issues, start the stopped node and make sure its PostgreSQL instance Pod runs properly. As the standby pod continuously synchronizes data, the primary database automatically clears WAL logs, gradually freeing up disk space. As shown in the following figure, the disk space occupied by WAL logs gradually decreases from 11 GB to 657 MB as the down node restarts.

 

Some nodes become NotReady and many Pods become abnormal upon network restoration for the primary site in an RDRS environment

Symptom

If you disconnect the network for the primary network of the RDRS environment, wait for a period of time, and then restore the network connection, some nodes at the primary site keep in NotReady state and large numbers of Pods become abnormal. The issue cannot be self-recovered.

Solution

The Docker log of the abnormal nodes contains a large number of lock requests and waits. After network restoration, process and Pod interlocks occur during the terminating and starting of Pods by Docker at the primary site, resulting in abnormal node state. This issue occurs with an extremely low probability.

To resolve the issue, execute the systemctl restart docker.service command at the backend of abnormal nodes to restart the Docker service.

 


Troubleshooting blocked access to Matrix services

The cluster rejects access to any Matrix service if the security policy drops all access requests

Symptom

If you enable a security policy with the default action set to DROP and all rules removed, the cluster rejects all access requests to any Matrix services.

Solution

This symptom occurs because the default action is set to DROP and the default rule that permits access to Matrix port 8443 is removed.

To resolve the issue:

1.     Log in to a master node from the management port.

2.     Access the disaster recovery directory.

[root@node1 ~]# cd /opt/matrix/k8s/disaster-recovery/

3.     Restore the default security policy settings.

[root@node1 ~]# bash recover-security-policies.sh

4.     If the issue persists, contact the support.

 


Troubleshooting Matrix login failure because of incorrect password

Login failure occurs because of incorrect password

Symptom

An admin user fails to log in to Matrix because of incorrect password.

Solution

To resolve the issue:

1.     Access the script directory on a master node and execute the bash script_name reset_password command to run the password reset script. The script_name argument represents the script name and the reset_password argument represents the new password. The script name varies by matrix operating status. In this example, the new password is Pwd@123456.

¡     If the matrix operates correctly, run the password reset script named resetMatrixUserPassword.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/usermanage/reset_password

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100   110    0    84  100    26    251     77 --:--:-- --:--:-- --:--:--   252

+ return_info='{"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600}'

+ [[ {"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

¡     If the matrix cannot operate correctly, run the emergency password reset script named resetMatrixUserPassword_emergency.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword_emergency.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -H X-Access-Mode:emergen  cy -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/use rmanage/reset_password

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100   110    0    84  100    26    253     78 --:--:-- --:--:-- --:--:--   253

+ return_info='{"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admi n","expireTime":600}'

+ [[ {"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admin","expire  Time":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

2.     Log in to Matrix again by using the new password.

3.     If the issue persists, contact the support.

Other users except admin fail to log in to Matrix when they enter incorrect passwords

Symptom

On the Matrix login page, the non-admin users will fail to log in due to forgotten passwords or other reasons.

Solution

Execute the corresponding script based on your cluster conditions to reset the password.

·     Reset the password when the cluster runs normally

a.     Access the script directory of a master node. Execute the script by using the bash resetMatrixUserPassword.sh username  reset_password command, where resetMatrixUserPassword.sh is the script name, username is the username, and reset_password is the new password. For example: bash resetMatrixUserPassword.sh test Pwd@123456.

[root@node1 ~]# cd /opt/matrix/k8s

[root@name0 k8s]# bash resetMatrixUserPassword.sh test Pwd@12345

Password reset to Pwd@12345 for user test succeeded.

b.     After the script is executed, log in to Matrix with the new password.

·     Reset the password in cluster emergency mode

a.     ‍First, reset the admin password and repair the cluster by following the steps in "Login failure occurs because of incorrect password."

b.     After the cluster recovers, reset the user password by using the method of resetting the password when the cluster runs normally in the same section.

 


Troubleshooting default route loss

Default routes get lost after the ifconfig command restarts the NIC

Symptom

The configured default routes get lost after you execute the ifconfig command on a node in the cluster to restart the NIC of the node.

Solution

To resolve the issue:

1.     Access the CLI of the node and then use the systemctl restart network command to restart the network service.

[root@node01 ~]# systemctl restart network

2.     Use the route -n command to verify that the default routes have been restored. The example output from this command is as follows:

[root@node01 ~]# route -n

Kernel IP routing table

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

0.0.0.0         10.99.212.1     0.0.0.0         UG    0      0        0 eth0

10.99.212.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0

192.168.122.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0

3.     If the issue persists, contact the support.

 


Troubleshooting ETCD service failure

ETCD service startup failure

Symptom

ETCD service startup failure occurs because of ETCD storage data file corruption or file loss when a node is power cycled.

·     After a node is power cycled, the ETCD service on it fails to start up because of database file corruption. This event eventually causes the cluster to malfunction.

The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd/ directory:

panic: freepages: failed to get all reachable pages (page 1407374894039040: out of bounds: 1264)

goroutine 116 [running]:

panic(0x55a1d6cce4a0, 0xc420202ef0)

        /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:551 +0x3c5 fp=0xc42006bf60 sp=0xc42006bec0 pc=0x55a1d5f0ae25

github.com/coreos/bbolt.(*DB).freepages.func2(0xc42020c180)

...

·     Data restoration failure causes file corruption.

Typically, the snap log file in the /var/lib/etcd/default.etcd/member/ directory must contain at least one log index value greater than the smallest log index value in the wal log file.

If all log index values in the wal log file are greater than those in the snap log file (as shown in Figure 7), the key operation log data will get lost and the ETCD service cannot recover data. In this case, file corruption occurs.

Figure 7 Viewing log index values in snap and wal log files

 

·     After a node is power cycled, the ETCD service on it fails to start up because of database snapshot file loss. This event eventually causes the cluster to malfunction.

The following log is displayed:

etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist

·     After a node is power cycled, the ETCD service on it fails to start up because of data file corruption. This event eventually causes abnormal node status.

The following log is displayed on the abnormal node:

"error":"walpb: crc mismatch"

Solution

To resolve the issue:

1.     Log in to each node, and execute the systemctl status etcd command to verify that the ETCD service is in running state.

2.     If the database file for the ETCD service of only one node is corrupted in standalone mode, use the procedure in "Solution for a standalone node."

3.     If the database file for the ETCD service of only one node is corrupted in cluster mode, perform the following steps:

a.     Log in to Matrix.

b.     Click the  icon in the upper right corner of the target node on the Deploy > Clusters > Deploy Cluster page, and then select Rebuild to rebuild the node.

4.     If the database file for the ETCD service is corrupted on two nodes in cluster mode, Matrix enter the emergency mode. You can resolve the issue by rebuilding the nodes one after another.

5.     If the database file for the ETCD service is corrupted on a three-node cluster, use the "Solution for a three-node cluster."

Solution for a standalone node

1.     Log in to the node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master1 ~]# systemctl status etcd

2.     Stop the Matrix service on the node.

¡     For a root account, execute the systemctl stop matrix command.

[root@master1 ~]# systemctl stop matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3.     Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service.

[root@master1 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4.     Completely stop the ETCD service and remove the ETCD data directory.

¡     For a root account, execute the systemctl stop etcd command to completely stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[root@master1 ~]# systemctl stop etcd

[root@master1 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master1 ~]# ll /var/lib/etcd/

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to completely stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5.     Access the ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/ to find the latest backup data file, such as Etcd_Snapshot_V900R001B06D012_20210805091547.db.

[root@master1 ~]# ll /opt/matrix/backup/etcd_backup_snapshot/

6.     Navigate to the ETCD restoration script directory and execute the restoration operation. The Etcd_Snapshot_*_*.db file specified in the restoration script is the latest backup data file found in step 5.

7.     Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

8.     Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

¡     For a root account, execute the following commands:

[root@master1 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

¡     For a non-root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

9.     Restart the ETCD service.

¡     For a root account, execute the systemctl restart etcd command.

[root@master1 ~]# systemctl restart etcd

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

10.     Restart the Matrix service.

¡     For a root account, execute the systemctl restart matrix command.

[root@master1 ~]# systemctl restart matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

11.     Restore the kube-apiserver service.

[root@master1 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

12.     Verify that the issue has been resolved.

a.     Use the VIP to log in to the GUI of the Matrix platform.

b.     On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status. The master node is in normal status, as shown in Figure 8.

Figure 8 One master node in normal status

 

13.     Click the OBSERVE tab, and then select Workload from the menu to view the running status of the Pods. All Pods are in running status, as shown in Figure 9.

Figure 9 All pods on the Pod tab in running status

 

Solution for a three-node cluster

1.     Log in to each master node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master2 ~]# systemctl status etcd

2.     Stop the Matrix service on each master node.

¡     For a root account, execute the systemctl stop matrix command.

[root@master2 ~]# systemctl stop matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3.     Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service on each master node.

[root@master2 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4.     Stop the ETCD service and remove the ETCD data directory on each master node.

¡     For a root account, execute the systemctl stop etcd command to completely stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[root@master2 ~]# systemctl stop etcd

[root@master2 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master2 ~]# ll /var/lib/etcd/

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to completely stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5.     Access the ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/ to find the latest backup data file, such as Etcd_Snapshot_V900R001B06D012_20210805091653.db. Make sure all nodes have the same backup data file. If a node lacks the file, you can copy the ETCD backup file from another node to it.

[root@master1 ~]# ll /opt/matrix/backup/etcd_backup_snapshot/

6.     Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

For data consistency, make sure all nodes use the same backup file. If a node does not have the file, copy one from another node.

7.     Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

¡     For a root account, execute the following commands:

[root@master2 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091653.db

2021-08-06 06:33:14.788657 I | mvcc: restore compact to 273930

2021-08-06 06:33:14.802137 I | etcdserver/membership: added member 312131d4535cc53f [http://10.99.212.124:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802189 I | etcdserver/membership: added member 5fc2f82d74297956 [http://10.99.212.123:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802206 I | etcdserver/membership: added member ad12c65048f444bd [http://10.99.212.120:2380] to cluster cd6d5adc1bfd16f5

¡     For a root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805014548.db

2021-08-06 01:22:10.876952 I | mvcc: restore compact to 12660679

2021-08-06 01:22:10.906116 I | etcdserver/membership: added member ac2cefc4cae84e25 [http://[2000::100:2000]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906174 I | etcdserver/membership: added member b4689a44b8c1f191 [http://[2000::100:2001]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906197 I | etcdserver/membership: added member c328a554c1ca84f4 [http://[2000::100:2002]:2380] to cluster ced7b5d5ee633b40

8.     Restart the ETCD service on each master node.

¡     For a root account, execute the systemctl restart etcd command.

[root@master2 ~]# systemctl restart etcd

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

9.     Restart the Matrix service on each master node.

¡     For a root account, execute the systemctl restart matrix command.

[root@master2 ~]# systemctl restart matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

10.     Restore the kube-apiserver service on each master node.

[root@master2 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

11.     Verify that the issue has been resolved.

a.     Use the VIP to log in to the GUI of the Matrix platform.

b.     On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status. The master nodes are in normal status, as shown in Figure 10.

Figure 10 Three master nodes in normal status

 

12.     Click the OBSERVE tab, and then select Workload from the menu to view the running status of the Pods. All Pods are in running status, as shown in Figure 11.

Figure 11 All pods on the Pod tab in running status

 

13.     If the issue persists, contact the support.

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk

Symptom

·     An ETCD client, such as a K8s or Matrix client, takes more than 800 ms to access the ETCD database. The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd directory on each master node.

2020-11-15 12:36:42.013987 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:295" took too long (877.352309ms) to execute

2020-11-15 12:36:54.026221 W | etcdserver: read-only range request "key:\"/registry/pods/base-service/\" range_end:\"/registry/pods/base-service0\" " with result "range_response_count:42 size:107232" took too long (1.767232614s) to execute)

·     Primary/standby switchover occurs frequently in the ETCD cluster, which might be caused by a heartbeat timeout.

Solution

This issue typically results from disk I/O performance degradation in a shared storage environment.

To resolve the issue:

1.     If the symptom occurs during execution of an application installation, upgrade, or configuration deployment task and results in task execution failure, perform the task again to fix the issue.

The system does not remove data synchronized during the execution of an installation, upgrade or configuration deployment task. Task re-execution will be likely to succeed because the amount of data to be synchronized decreases, causing less impact on disk I/O performance.

2.     If the symptom occurs during normal operation of the system, edit the matrixLeaderLeaseDuration (lease aging time) and matrixLeaderRetryPeriod (lease detection interval) parameters to delay the timeout for primary/backup switchovers.

However, this change will increase the failover time.

3.     If writing failure or data loss occurs because of poor disk I/O performance, use one of the following methods to fix the issue:

¡     If a Pod has a status or communication anomaly, use the kubectl delete pod -n namespace podName command to remove the Pod. The system will create a Pod automatically to restore the ETCD data source.

¡     Method 2: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."

¡     Method 3:

-     Uninstall Matrix from all nodes.

-     Reinstall Matrix on all the nodes.

-     Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.

 


Troubleshooting docker service failure

No response after executing docker commands

Symptom

No response is returned a long time after you execute the docker ps, docker images, docker inspect, and docker rmi commands.

Solution

1.     Restart the docker service.

¡     For the root user, execute the following command to restart the docker service:

[root@master1 ~]# systemctl restart docker

¡     For a non-root user, execute the following command to restart the docker service:

[admin@master1 ~]$ sudo /bin/bash -c "systemctl restart docker"

2.     Verify that the docker service becomes normal.

¡     For the root user, execute the docker images command to view the docker service.

¡     For a non-root user, execute the sudo /bin/bash -c " docker images " command to view the docker service.

The docker service becomes normal if the command output displays the image information of the current node.

Figure 12 Output from the docker images command

 

The page displays that the Docker component is abnormal

Symptom

Power cycling damages the system files in the var/lib/docker directory and causes the Docker service status to become abnormal.

Solution

To resolve this issue, reinstall the ISO image and rebuild the node.

Figure 13 Rebuilding the node

 


Troubleshooting failures caused by power cycle of the server or network disconnection

Files in the operating system were lost after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     Contents in the chronyd.service, docker.service, and containerd.service files in the /usr/lib/systemd/system directory were lost.

·     Contents in the chrony.conf, docker, etcd, hosts, and ssh configuration files in the /etc/ directory were lost. The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.

·     The log files or contents in the log files in the /var/log directory were lost.

Solution

To resolve the issue:

·     The contents in the chronyd.service, docker.service, and containerd.service files were lost:

a.     Execute the ls /usr/lib/systemd/system/service-name.service command to identify whether the service file exists on each node or whether no information exists in the file.

b.     If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c.     If the file does not exist in any node, contact the support or re-install the operating system.

·     The files or contents in the log files in the /etc/ and /var/log directories were lost:

Contact the support or re-install the operating system.

·     The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.

In cluster mode, copy the deployenv.sh file from another Master node that has that file. If no master nodes have that file, rebuild the node. In standalone mode, contact the support or reinstall Matrix.

Files where Matrix is dependent were lost after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The service file or contents in the service file for the etcd or matrix service were lost:

·     The configuration files, for example, navigator_config.json or contents in the files in the /opt/matrix/ directory were lost.

·     The script files, for example, docker.sh, or contents in the files in the /opt/matrix/ directory were lost.

·     The Docker image files in the /var/lib/docker directory were damaged.

Symptom 1: Some pods were in ImagePullBackOff state. The event log for the describe pod is as follows:

error creating overlay mount to /var/lib/ /overlay2/698028ac124c9d0ef831f7d2d9506acd01faddaae6ea06a0a169fb352e0eddf4/merged: too many levels of symbolic links 

Symptom 2: The following message was displayed:

time="2021-05-10T18:05:50.518918884+08:00" level=error msg="Handler for GET /containers/2494c1172314e37bd8250be06a24e0636b7427f89b3b5a5398ecfad7c2fe171d/json returned error: readlink /var/lib/docker/overlay2/l: invalid argument"

·     The YAML file or contents in the YAML file in the /opt/matrix/ directory were lost.

Solution

To resolve the issue:

·     The service file or contents in the service file were lost, or the files or contents in the files in the /opt/matrix/ directory were lost:

a.     Execute the ls command to identify whether the service file exists on each node or whether no information exists in the file.

b.     If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c.     If the file does not exist in any node, contact the support or re-install Matrix.

·     The Docker image files in the /var/lib/docker directory were damaged:

a.     ‍Upload the Matrix package to rebuild the node.

b.     Contact the support.

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during the Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

·     Some Pods in CreateContainerError state on the OBSERVE > Workload page, indicating that product-related Pods are abnormal.

To view all Pods in CreateContainerError state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep CreateContainerError command.

[root@node1 home]# kubectl get pod -A -owide | grep CreateContainerError

NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE   IP NODE       NOMINATED NODE   READINESS GATES

kube-system   calico-kube-controllers-cd96b6c89-hfz7s   0/1     CreateContainerError   0  29d 10.99.212.164    node1   <none>           <none>

Solution

To resolve the issue, use either of the following methods:

·     Method 1:

a.     Access the node where the abnormal Pod resides and execute the docker ps | grep podname | grep -v POD | grep Up|awk '{print $1}' command to obtain the ID of the container in up state. Podname represents the name of the abnormal Pod.

[root@node1 home]# docker ps |grep calico-kube-controllers-cd96b6c89-hfz7s | grep -v POD|grep Up|awk '{print $1}'

c755b7812380

b.     Execute the docker stop containerid && docker rm containerid command, for example, docker stop c755b7812380 && docker rm c755b7812380, to delete the container in up state.

c.     Execute the kubectl get pod -A -owide | grep CreateContainerError command to check for Pods in CreateContainerError state. If such Pods still exist, log in to the GUI of the Matrix platform to rebuild the node.

·     Method 2: Log in to the GUI of the Matrix platform and rebuild the node where the abnormal Pod resides.

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

·     Some Pods in Error state on the OBSERVE > Workload page, indicating that product-related Pods are abnormal.

To view all Pods in Error state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command.

You can also access the CLI of the node where the Pod in Error state resides and execute the cat /var/log/matrix-diag/Matrix/kubelet/kubelet.log | grep "unexpected end of JSON input" command to view the kubelet logs about the node. If the following error message is displayed, Pod data is corrupted because of power cycle of the node and the Pod cannot start up.

Multus: failed to load netconf: unexpected end of JSON input

Solution

To resolve the issue, use either of the following methods:

·     Method 1 (applicable to scenarios where a small number of Pods are in Error state):

a.     Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command to view the namespaces and names of Pods in Error state.

b.     Execute the kubectl delete pod -n namespace podName command to delete a Pod in Error state.

To delete multiple Pods in Error state, execute the command multiple times.

·     Method 2 (applicable to scenarios where a large number of Pods are in Error state):

Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error| awk '{print $1 " " $2}'| xargs kubectl delete pod –n command. This command deletes all Pods in Error state.

After you power off and restart the server where the node resides, the Prometheus data file becomes corrupted, causing abnormal Pod status

Symptom

Execute the kubectl get pod -n monitor -owide | grep prometheus command to check the Prometheus Pod name and status. You will find Pods in CrashLoopBackOff state. Use the kubectl logs -f  -n monitor prometheus-podname prometheus-server command to check the logs. The message that errorerr="opening storage failed: /data/xxx is displayed.

Solution

1.     Use the rm -rf /var/lib/ssdata/imonitor/prometheus_data/ command to delete the Prometheus data files on the node where the abnormal Pod resides.

2.     Copy the prometheus_data file from the node with a normal Pod to the node with an abnormal Pod. If all Pods are abnormal, delete the prometheus_data file on every node.

3.     Restart the abnormal Pod.

Power cycle of the node server results in unstable IPv6 NIC for the MACVLAN additional network

Symptom

A Matrix cluster has been deployed. After the server of the node hosting the application is power cycled, the NIC becomes unavailable. The specific symptoms are as follows:

Execute the kubectl exec -it -n kube-system harbor-master1-6mvlb /bin/bash command to enter the container. Then execute the ip a command to check all NIC IPs within the container. If the state of the IPv6 NIC (for example, eth2@if3) in the MACVLAN additional network is tentative dadfailed, this IPv6 NIC is unavailable.

[root@vdhcpsrc1-6658fb96f4-j4n4f /]# ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

       valid_lft forever preferred_lft forever

    inet6 ::1/128 scope host

       valid_lft forever preferred_lft forever

3: eth0@if914: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

    link/ether 6e:e7:ed:2c:ed:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0

    inet 177.177.204.216/32 scope global eth0

       valid_lft forever preferred_lft forever

    inet6 fd00:177:177:0:d416:1f2a:c3a4:ccac/128 scope global

       valid_lft forever preferred_lft forever

   inet6 fe80::6ce7:edff:fe2c:ed5e/64 scope link

       valid_lft forever preferred_lft forever

4: eth1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

    link/ether d6:ae:4e:73:38:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0

    inet 110.1.0.105/24 scope global eth1

       valid_lft forever preferred_lft forever

    inet6 fe80::d4ae:4eff:fe73:388d/64 scope link

       valid_lft forever preferred_lft forever

5: eth2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

    link/ether a2:23:c0:8f:ac:46 brd ff:ff:ff:ff:ff:ff link-netnsid 0

    inet6 130::105/64 scope global tentative dadfailed

       valid_lft forever preferred_lft forever

    inet6 fe80::a023:c0ff:fe8f:ac46/64 scope link

       valid_lft forever preferred_lft forever

Solution

After the server hosting the application is power cycled, the operating system might fail to reclaim the application processes that use the MACVLAN additional network and IPv6 subnets, resulting in residual application processes. In this case, restart the server hosting the container again for recovery.

Power cycle of the node server results in repeated restarts of the Pod that uses the additional network

Symptom

During normal Matrix operation or cluster/application deployment (such as cluster deployment, upgrade, restoration, and rebuild, and application deployment and upgrade), if the server hosting a node restarts, the following symptom occurs:

If you log in to the backend of any master node, and use the kubectl get pod -A -o wide | grep -v Running command to view the Pod status, you can see that the Pod that uses the additional network repeatedly restarts. Use the kubectl describe pod -n namespace podName command to view Pod events. The namespace is the namespace of the abnormal Pod, and podName is the name of the abnormal Pod. The error message is as follows:

Err adding pod to network"net-dtn-sim203": Multus: error in invoke Delegate add -"macvlan": failed to allocate for range 0: requested IP address172.30.128.140 is not available in range set3.2.27.1-3.2.27.254,172.30.128.1-172.30.128.254

Solution

The content loss of the additional network configuration file is due to a power cycle. To resolve this issue, clean the configuration file and follow these steps to perform restoration by using the root user account to log into the backend of the node where the abnormal Pod is located:

1.     Use the cd /var/lib/cni/networks/${macvlan-name}/ command to enter the configuration directory, where macvlan-name is the network name in the error message, such as net-dtn-sim203 in the example.

2.     Use the rm -rf ${IP} command to delete the damaged configuration file, where IP is the IP address in the error message, such as 172.30.128.140 in this example.

 


Troubleshooting deployment failure

Cluster deployment failed and a K8SINSTALL-ERROR occurred

Symptom

Cluster deployment failed. The node logs page accessed by clicking the setting icon  for the node displays K8SINSTALL-ERROR.

Solution

This issue might occur if the node has multiple NICs in up state.

If a NIC is not configured with an IP address:

By default, the arp_ignore parameter is set to 0 in an operating system and the system responds to all ARP requests for the local IP address, including requests for the loopback adapter's address. The matrix node might encapsulate the MAC address of the NIC not configured with an IP address in ARP responses, causing connection errors between cluster nodes.

To resolve the issue:

1.     Use the ifconfig command to view the NIC order when deploying, upgrading, or rebuilding a cluster in a scenario where multiple NICs exist.

2.     Make sure all physical NICs before the NIC whose IP is used as the node IP in the Matrix cluster have IPs assigned or the ONBOOT value set to no. For example, if the node uses the IP of NIC ens191 as the node IP in the Matrix cluster, and ens190 is before ens191 in order, make sure ens190 has an IP assigned.

3.     Make sure no abnormal NIC configuration file exists in the cluster. For example, the ONBOOT value is yes when the NIC is not connected or not configured with an IP.

4.     If the cluster uses a bonding interface as the node IP in the Matrix cluster, make sure all physical NICs not the members of the bonding interface have IPs assigned or the ONBOOT value set to no.

5.     Reboot the network service.

Unified Platform deployment failed because the kubectl exec command failed

Symptom

The system failed to deploy Unified Platform. The log shows that gfs volume creation failed because one node failed to execute the kubectl exec command (the command output reports an error). If you execute the kubectl exec -it pod bash command on that node, all Pods on the node are inaccessible.

Solution

To resolve the issue:

1.     Log in to the CLI of the node that failed to execute the kubectl exec command.

2.     Execute the systemctl restart kubelet.service command to restart the kubelet service on the node.

3.     If the issue persists, contact the support.

Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC

Symptom

The rebuild or upgrade operation fails for a cluster if you add a vNIC to a node or add a new IP address to the vNIC on that node, because the IP address of other nodes in this cluster and this new IP address are not on the same subnet. The system prompts ping failure if you execute the ping6 pod_ip command at the CLI of that node. The pod_ip argument represents the IP address of the container. You can use the kubectl get pod -n kube-system -o wide command to obtain that IP address.

Solution

To resolve the issue:

1.     Change the new IP address to an IP address that is on the same subnet as other nodes.

2.     Configure a routing policy on other nodes so that these nodes can communicate with each other.

3.     If the issue persists, contact the support

 

 


Troubleshooting Unified Platform access failure

Slow request processing caused by ETCD I/O latency

Symptom

Failed to access Unified Platform.

When you review the ETCD log (/var/log/matrix-diag/Matrix/etcd/etcd.log), the following prompt is displayed:

context deadline exceeded, waiting for ReadIndex response took too long, retrying,

When you review the apiserver log and the following prompt is displayed:

stopped listening on [::]:6443

Solution

This symptom might occur when the API server failed to obtain data from ETCD multiple times because of ETCD latency, and then stopped listening to port 6443. In this case, the service components failed to use the K8s API through port 6443.

To resolve the issue:

1.     Verify that the disk I/O performance can meet the requirements. If the average I/O performance value is equal to or greater than 10000, the disk I/O performance meets the requirements. If the average I/O performance value is smaller than 10000, an error has occurred on the disk I/O performance, and you must improve the disk I/O performance.

To test the disk I/O performance:

¡     For the root user, execute the bash /opt/matrix/tools/env_check.sh -p -d /var/lib/etcd command.

¡     For a non-root user, execute the sudo bash /opt/matrix/tools/env_check.sh -p -d /var/lib/etcd command.

2.     Execute the kubectl get pod -n service-software | grep stolon-keeper command to obtain the names of all stolon-keeper Pods.

3.     Execute the kubectl delete pod -n service-software pod_name command to restart the stolon-keeper Pods one by one.

4.     Access Unified Platform after all stolon-keeper Pods recover to the running state.

Unified Platform login failure due to password loss

Symptom

If the user password is lost, the user cannot log in to Unified Platform.

Solution

The cause of this issue is loss of the user password.

Follow these steps to reset the password:

1.     Log in to the /opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset/ directory at the backend of the primary node.

[root@node1 ~]# cd /opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset/

2.     Run the reset_admin_password.sh script to reset the password.

[root@node1 reset]# ./reset_admin_password.sh

+++ readlink -f ./reset_admin_password.sh

++ dirname /opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset/reset_admin_password.sh

+ curDir=/opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset

+ echo /opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset

/opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset

+ cd /opt/matrix/app/install/metadata/UCENTER/portal/portal/portal/scripts/reset

++ kubectl get pod -nservice-software

++ grep stolon-proxy

++ grep Running

++ awk 'NR==1{print $1}'

+ proxy=stolon-proxy-2tp4v

+ kubectl cp ./reset_admin_password.sql -nservice-software stolon-proxy-2tp4v:/tmp

++ base64 -d

++ kubectl get secret -nservice-software stolon '-ojsonpath={.data.password}'

+ pg_admin_password=Or10TL+hRs.=N@0l54

+ kubectl exec -it -nservice-software stolon-proxy-2tp4v -- bash -c 'export PGPASSWORD=Or10TL+hRs.=N@0l54;psql -U kong -d central_db -h 127.0.0.1 -p 5432 -f /tmp/reset_admin_password.sql'

UPDATE 1

+ exit 0

3.     Reset the password to Pwd@12345.

Login anomalies caused by Ext4 file system damage

Symptom

You cannot log in to Matrix or Unified Platform. When you execute any command in the backend, the system returns error message "Read-only file system".

Solution

After a node restarts, if it gets stuck during the system boot process and displays messages like "You are in emergency mode" or "Entering emergency mode", this indicates that the node has entered emergency mode of the operating system. To resolve the issue in this mode:

1.     Execute the dmesg | less or cat /var/log/messages command to view the system logs and identify the partition of the damaged file system. Use partition /dev/sdX as an example of a faulty partition.

2.     Execute the mount|grep /dev/sdX command to view the partition type. If the partition type is ext4, proceed with the following steps.

3.     Execute the umount /dev/sdX command to unmount the file system of that partition.

4.     Execute the e2fsck -y /dev/sdX command to repair the partition. If the file system is severely damaged, data loss might occur after the repair.

 


Troubleshooting GlusterFS anomalies

GlusterFS deployment fails because the disk or disk partition it uses has residual data

Symptom

Use a terminal tool, such as MobaXterm, to connect to the backend at the northbound IP and view the Matrix logs in the /var/log/matrix-diag/Matrix/Matrix/matrix.log file. The Matrix platform displays a message that the disk or disk partition GlusterFS uses has residual data.

Device include vg , nodename:node1, device:/dev/vda3

Cause analysis

Heketi for GlusterFS requires an empty disk or disk partition for deployment, but the disk or disk partition for GlusterFS deployment contained data. You must clean up the disk manually.

Solution

To resolve the issue:

1.     Access the directory where the disk cleanup script resides.

[root@m2 ~]# cd /

opt/matrix/app/install/metadata/gluster/gluster/scripts/tools/

2.     Execute the bash clearDisk.sh disks command to clean up disks, where disks represents the disks to clean up. You must enclose the list of disks in double quotation marks. If you are to clean up multiple disks or disk partitions, use spaces to separate them, for example, bash clearDisk.sh "/dev/vdb /dev/vdc".

[root@m2 ~]# bash clearDisk.sh "/dev/vdb /dev/vdc"

[clear_disk] CAUTION: Please confirm whether to erase the disk /dev/vdb /dev/vdc

Continue anyway? (Y/N) : y

[clear_disk] CAUTION: Please confirm whether to clear glusterfs config file

Continue anyway? (Y/N) : y

[clear_disk] Disk erase complete.

CAUTION

CAUTION:

Identify the disk or disk partition before cleaning up it.

 

3.     Repeat the preceding procedure on all the other master modes and then redeploy GlusterFS.

4.     If the issue persists, contact the support.

Failed to install components that use GlusterFS storage because storage volumes cannot be deleted

Symptom

The installation of a component that uses GlusterFS storage failed. The Matrix logs show that the system failed to delete storage volumes when the installation script uses the volume.sh script. When you execute commands to delete storage volumes on the server where the active master node resides, an error still occurs and the storage volumes fail to be deleted.

Cause analysis

For installation of a component that uses GlusterFS storage, the installation script will use GlusterFS heketi commands to delete and re-create the GlusterFS storage volume. Due to open source issues, during storage volume deletion, an error message appears indicating that the storage volume is mounted to the operating system. However, the mounting information of the operating system does not display any mounting information about the storage volume that causes the deletion failure of the storage volume.

Solution

To resolve the issue:

1.     Access the CLI of each master node and restart the server where the nodes reside. As a best practice, restart the standby master nodes and then the active master node.

2.     Deploy the components that use GlusterFS storage after the cluster recovers to normal.

Failed to upgrade components that use GlusterFS storage because glusterd exits

Symptom

1.     When the upgrade of a component that uses GlusterFS storage failed, access the CLI of the node and review the Matrix logs. The logs show that the upgrade of the component that uses GlusterFS storage has failed multiple times.

2.     Access the CLI of a master node, and execute the kubectl get po -A -owide | grep   glusterfs command to obtain the names of all the GlusterFS Pods in running state.

[root@matrix ~]# kubectl get po -A -owide | grep   glusterfs

glusterfs-example   glusterfs-l6fcr                1/1     Running   0   3d23h   10.99.212.200 matrix   <none>           <none>

glusterfs-example   heketi-848f8f7dd6-nc2kq                       1/1     Running   0 3d23h   177.177.95.77    matrix   <none>           <none>

glusterfs-example   monitor-84964d7cd7-2wjrr                      1/1     Running   0 3d23h   177.177.95.78    matrix   <none>           <none>

3.     Execute the kubectl exec -it -n glusterfs-example   glusterfs-l6fcr /bin/bash command to access GlusterFS Pods.

4.     Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command, but the glusterd process is not displayed in the command output.

Solution

This symptom might occur if the glusterd process in GlusterFS Pods abnormally exits during upgrade of components that use GlusterFS storage, which causes the failure of executing storage-related scripts.

To resolve the issue:

1.     Execute the kubectl get po -A -owide | grep   glusterfs command to obtain the names of all GlusterFS Pods in running state, for example, glusterfs-l6fcr.

2.     Execute the kubectl exec -it -n glusterfs-example   glusterfs-l6fcr /bin/bash command to access containers in the GlusterFS Pod.

3.     Execute the systemctl restart glusterd command to restart the glusterd process.

4.     Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command to identify whether the glusterd process has been restarted.

5.     After the glusterd process restarts, upgrade the components that use GlusterFS storage again.

GlusterFS service anomaly occurred after Matrix rebuilding by using the ISO image

Symptom

After a node is rebuilt by reinstallation of the ISO image, no GFS-related data can be found in the lsblk command output on the rebuilt node, and operations such as volume creation fail.

Analysis

The possible causes for this issue include:

·     GlusterFS Heketi requires a blank disk or a blank disk partition, but other data exists in the disk or disk partition used by GlusterFS. As a result, GlusterFS data cannot be synchronized to the rebuilt node.

·     The partition number of GFS is inconsistent with that on the node before rebuilding.

·     The glusterd service in the Glusterfs pod of the rebuilt node is still in an abnormal state when GlusterFS initiates data synchronization.

Solution

1.     If other data exists in the disk or disk partition used by GlusterFS, perform the following steps:

a.     Enter the directory where the disk cleaning script is stored.

[root@m2 ~]# cd /opt/matrix/app/install/metadata/gluster/gluster/scripts/tools/

b.     Execute the bash clearDisk.sh disks command to execute the script, where disks represent disks that need to be cleared. You must use double quotation marks to enclose the disks or disk partitions and separate them by spaces, for example, bash clearDisk.sh "/dev/vdb /dev/vdc".

[root@m2 ~]# bash clearDisk.sh "/dev/vdb /dev/vdc"

[clear_disk] CAUTION: Please confirm whether to erase the disk /dev/vdb /dev/vdc

Continue anyway? (Y/N) : y

[clear_disk] CAUTION: Please confirm whether to clear glusterfs config file

Continue anyway? (Y/N) : y

[clear_disk] Disk erase complete.

CAUTION

CAUTION:

Clearing a disk is risky. Identify the disk or partition you are to delete carefully.

 

2.     Verify that partition information for GFS after rebuilding is consistent with that before rebuilding.

[root@c1 ~]# cat /opt/matrix/app/install/metadata/gluster/gluster/heketi/config/cluster.json

{

  "node" : [ {

    "nodename" : "c1",

    "device" : [ "/dev/vdc" ]

  }, {

    "nodename" : "c2",

    "device" : [ "/dev/vdc" ]

  }, {

    "nodename" : "c3",

    "device" : [ "/dev/vdc" ]

  } ]

3.     Reboot the glusterfs pod of the rebuilt node, for example, delete the glusterfs pod with hostname c3.

[root@c3 ~]# kubectl get pod -A -owide |grep glusterfs |grep c3

glusterfs-example   gfs-exporter-php62                         1/1     Running   0          46m    10.99.212.72     c3     <none>           <none>

glusterfs-example   glusterfs-fh2cc                            1/1     Running   0          46m    10.99.212.72     c3     <none>           <none>

glusterfs-example   heketi-75d6c7db69-vhzh2                    1/1     Running   0          26m    177.177.240.5    c3     <none>           <none>

glusterfs-example   monitor-5f9bd8ccb4-54mrn                   1/1     Running   0          26m    177.177.240.4    c3     <none>           <none>

[root@c3 ~]# kubectl delete pod -n glusterfs-example   glusterfs-fh2cc

pod "glusterfs-fh2cc" deleted

For more recovery procedure, see the remote replication methods for GlusterFS storage volume data files at the GlusterFS official website https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-georep/.

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlocks in GlusterFS

Symptom

Unified Platform upgrade/uninstallation or remote disaster recovery fails.

Solution

Failure cause: A deadlock is a common issue in distributed systems, and might occur in GlusterFS. A deadlock occurs when a set of processes or threads are unable to proceed because each is waiting for a resource that another holds. In practical applications, administrators should focus on designing a robust system architecture and implementing reasonable operation procedures to minimize deadlocks. They should also use the system's deadlock prevention and resolution mechanisms to ensure system reliability and stability.

To resolve the issue:

1.     Enter each GlusterFS pod in sequence and execute the following command to check for output. If any information is output, it indicates a deadlock exists.

cat /var/log/glusterfs/glusterd.log | grep "0-management: Failed to release mgmt_v3 locks"

cat /var/log/glusterfs/glusterd.log | grep "0-management: Failed to release lock for vol"

2.     If you acknowledge a deadlock, execute the systemctl restart glusterd.service command in the pod to restart the glusterd service and release the lock.

After you restart a node or the network changes, the mount directories of some pods that use the GlusterFS storage might have question marks (???) in the file content and the file content cannot be read or written

Symptom

??? files are displayed when the ls –l command is executed in the GFS mounted directory on the host or service container, and the files are not accessible.

Cause analysis

The remaining disk space on the nodes where the three replicas of the GlusterFS storage volume reside is inconsistent, resulting in data inconsistency between the three replicas after data writing. As a consequence, the GlusterFS storage volume data file has a brain split.

Solution

To resolve the issue:

1.     Execute the kubectl get po -A |grep glusterfs command to view the names and namespaces of the GlusterFS Pods.

2.     Execute the kubectl exec command to access the GlusterFS container and then execute the gluster volume heal VolumeName info command to view whether words Is in split-brainis are displayed in the command output and record the file paths. VolumeName is the name of the faulty storage volume. To obtain the storage volume name, execute the kubectl exec –it {gfs pod namespace+gfs pod name} - gluster volume list | grep {service data volume name} command.

3.     Use one of the following methods to resolve the issue:

¡     To resolve the issue based on the file size, execute the gluster volume heal VOLNAME split-brain bigger-file filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the largest file.

¡     To resolve the issue based on the update time of the file, execute the gluster volume heal VOLNAME split-brain latest-mtime filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the most recently generated file.

4.     Use the procedures provided at https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/ to resolve the issue.

5.     If the issue persists, contact the support.

Unified Platform upgrade/uninstallation or remote disaster recovery failure caused by deadlock in GlusterFS

Symptom

Unified Platform upgrade/uninstallation or remote disaster recovery fails.

Solution

Failure cause: A deadlock is a common issue in distributed systems, and might occur in GlusterFS. A deadlock occurs when a set of processes or threads are unable to proceed because each is waiting for a resource that another holds. In practical applications, administrators should focus on designing a robust system architecture and implementing reasonable operation procedures to minimize deadlocks. They should also use the system's deadlock prevention and resolution mechanisms to ensure system reliability and stability.

To resolve the issue:

1.     Enter each GlusterFS pod in sequence and execute the following command to check for output. If any information is output, it indicates a deadlock exists.

cat /var/log/glusterfs/glusterd.log | grep "0-management: Failed to release mgmt_v3 locks"

cat /var/log/glusterfs/glusterd.log | grep "0-management: Failed to release lock for vol"

2.     If you acknowledge a deadlock, execute the systemctl restart glusterd.service command in the pod to restart the glusterd service and release the lock.

In the GlusterFS disaster recovery environment, the primary and backup clusters experience data synchronization issues due to the absence of scheduled tasks

Symptom

After you create a disaster recovery system, the data synchronization status of glusterfs in the remote disaster recovery menu might be displayed as Synchronizing. Log in to the backend of the primary cluster through the northbound IP and execute the cat /etc/crontab command. The command output shows that the scheduled task /opt/matrix/app/install/metadata/gluster/gluster/glusterfs/scripts/rdr/gfs_task.sh is missing.

Solution

The reason for the data synchronization anomaly is that some nodes in the cluster do not have the scheduled task gfs_task.sh for synchronizing data. The specific cause of the missing scheduled task is unclear, but it might be related to the server environment. To resolve this issue:

1.     Log in to each master node of the primary and  backup clusters in sequence and execute the cat /etc/crontab command to search for scheduled tasks. If the output does not contain the /opt/matrix/app/install/metadata/gluster/gluster/glusterfs/scripts/rdr/gfs_task.sh line, proceed with step 2. If the output contains the line, skip this node without taking any action.

2.     Execute the sed -i '$a\*/3 * * * * root /opt/matrix/app/install/metadata/gluster/gluster/glusterfs/scripts/rdr/gfs_task.sh >>/var/log/gfs_task_log 2>&1' /etc/crontab command to add a scheduled task for GlusterFS data synchronization to the node.

 


Troubleshooting IP editing failure

Failed to edit the VIPs

Symptom

Failed to edit VIP settings after in the Advance section on the Deploy > Clusters > Cluster Parameters > Modify Cluster Parameters page. The Matrix logs show the following error messages:

2022-02-16T10:33:52,207 | INFO  | DeployResource-11-thread-1 | K8sClientHelper.getConfigMapByName:2120 | [K8sClientHelper] get configmap by name param: namespace kube-system, configmapName kube-proxy

2022-02-16T10:33:52,227 | ERROR | DeployResource-11-thread-1 | DefaultUncaughtExceptionHandler.uncaughtException:18 | uncaught exception in Thread[DeployResource-11-thread-1,5,main], stack: [java.lang.Thread.getStackTrace(Thread.java:1559), com.h3c.matrix.util.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:18), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052), java.lang.Thread.dispatchUncaughtException(Thread.java:1959)]

java.util.ServiceConfigurationError: io.fabric8.kubernetes.api.KubernetesResourceMappingProvider: Provider io.fabric8.kubernetes.internal.InternalResourceMappingProvider not found

Solution

This symptom might occur if Matrix failed to obtain ConfigMaps due to Fabric8 issues.

To resolve the issue, execute the systemctl restart matrix to restart the current node, and then edit VIP settings again.

 


Troubleshooting image corruption

Image corruption

Symptom

An image corruption error occurs if one of the following symptoms occurs:

·     Symptom 1:

When a Pod is in ImagePullBackOff state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:

too many levels of symbolic links

The namespace argument represents the name space of the Pod, and the podName argument represents the name of the Pod.

·     Symptom 2:

When a Pod is in ImageInspectError state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:

readlink /var/lib/docker/overlay2/l: invalid argument"

The namespace argument represents the name space of the Pod, and the podName argument represents the name of the Pod.

·     Symptom 3:

Frequently restart of the kubelet service on the node triggers abnormal K8S node status (Not Ready state). The node is marked as red on the page. The node details shows that the kubelet and coreDNS check items are abnormal. From the kubelet log of the faulty node (/var/log/matrix-diag/Matrix/kubelet/kubelet.log), an error log message is frequently output that Image garbage collection failed once. Stats initialization may not have completed yet" err="failed to get imageFs info: unable to find data in memory cache.

Solution

To resolve the issue:

1.     Execute the following commands in sequence to delete all containers and images on the node where the faulty Pod resides:

[root@master1 ~]# systemctl restart docker

[root@master1 ~]# docker system prune

[root@master1 ~]# docker rm -f $(docker ps -aq)

[root@master1 ~]# docker rmi -f $(docker images -q)

2.     If the issue persists, log in to the Matrix platform and rebuild the node where the faulty node resides.

Image layer verification failed

Symptom

The installation or upgrade of an application component failed. Use the kubectl describe pod -n namespace podName command to view the event log, where the namespace argument represents the namespace of the pod and the podName argument represents the name of the pod. In this example,  the namespace is development and the pod name is my-app-pod.

[root@master1 ~]# kubectl describe pod -n development my-app-pod

Filed to pull image matrix-registry.h3c.com:8088/matrix:metrics-server:v0.6.4: rpc error: code =Unknow desc = filesystem layer verification failed for digest sha256:xxxx 

Solution

Access the /opt/matrix/k8s/disaster-recovery/ directory. If the recover-image.sh script exists, execute the bash recover-image.sh imageName:imageTag command. The imageName and imageTagarguments represent the image name and version number in the log query result. In this example, the following command uses matrix and metrics-server:v0.6.4 as the image name and image version number. After the command is successfully executed, deploy or upgrade the component again. If the script does not exist, contact Technical Support for help.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

[root@master1 disaster-recovery]# bash recover-image.sh matrix:metrics-server:v0.6.4

 


Troubleshooting PXC failures

Failed to start the PXC database

Symptom

One of the following symptoms might appear:

·     After power cycle of the server, the application services failed to start up. The runtime log prompts that the database connection is abnormal.

·     Failed to log in to the CLI of a master node.

A successful database login is as shown in the following figure:

 

Solution

1.     Execute the following command to delete Pods in the database cluster:

kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod

 

2.     Execute the kubectl logs -f command to obtain the database container startup log.

¡     In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

¡     In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.

 

 

Disk file corruption of PXC database

Symptom

In cluster mode, the PXC database failed to operate correctly after the server power cycled. In addition, the service Pods associated with the database failed to start up even when the PXC database restarted.

Solution

1.     Execute the kubectl logs -f to obtain the startup logs about each container for the PXC database and identify which containers failed to prompt Starting MySQL (Percona XtraDB Cluster) database server.

2.     Execute the following command to stop starting the corrupted pxc-node containers:

kubectl delete -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml

3.     Clear the permanent directories of the corrupted containers. In this example, Containers pxc-node1, pxc-node2, and pxc-node3 are associated with nodes master1, master2, and master3 of Matrix, respectively. The permanent directories for containers pxc-node1, pxc-node2, and pxc-node3 are /var/lib/ssdata/pxc/pxc/{1}, /var/lib/ssdata/pxc/pxc/{2}, and /var/lib/ssdata/pxc/pxc/{3}, respectively. Execute the rm -rf /var/lib/ssdata/pxc/pxc/{1/2/3}/ command to delete the corresponding permanent directory.

As a best practice, move the files in a permanent directory to another directory and delete them after the repair succeeds.

4.     Execute the following command to restart the pxc-node containers that have stopped:

kubectl apply -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml

5.     Execute the kubectl logs -f command to obtain the database container startup log.

In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

 

 

PXC database in abnormal state

Symptom

·     Symptom 1:

The services associated with the PXC database can connect to the database but cannot use the database. For example, a service might receive a response like WSREP has not yet prepared node for application use.

·     Symptom 2:

The services associated with the PXC database can connect to the database but cannot use the database. The log might receive a response indicating that it timed out waiting for a lock to be unlocked and you can try to send the request again.

·     Symptom 3:

The services associated with the PXC database can connect to the database but cannot use the database. No response is returned from the database.

Cause analysis

·     Cause analysis for symptom 1:

This symptom might occur if a brain split occurs on the database cluster. Typically, the database cluster can recover to normal automatically. You can access the CLI of a master node to identify the state of the database cluster. If the database cluster has recovered to normal, the query result for the wsrep_local_state_comment, wsrep_ready, and wsrep_incoming_addresses fields are as shown in the following figure:

 

If the query result is different, the cluster has not recovered from the brain split. For example:

¡     The value for the wsrep_local_state_comment field might be Initialized or Joining: receiving State Transfer.

¡     The value for the wsrep_ready field might be OFF, which indicates that the current container is not available.

¡     The wsrep_incoming_addresses field does not display IP addresses of all pxc-node containers, which indicates that not all the containers are in the same database cluster.

·     Cause analysis for symptom 2:

This symptom might occur if the database has dead locks. The dead locks might be metadata locks or exclusive locks.

·     Cause analysis for symptom 3:

This symptom might occur if data synchronization across the database cluster failed.

Solution

1.     Execute the following command to delete Pods in the database cluster:

kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod

 

2.     Execute the kubectl logs -f command to obtain the database container startup log.

¡     In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

¡     In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.

 

 

Loss of all contents of PXC database startup file grastate.dat

Symptom

In standalone mode, the PXC database fails to start up after the server shuts down unexpectedly and then restarts. As the result, pods dependent on the PXC database fail to start. The issue persists after the PXC database pod is restarted.

·     View the grastate.dat file of the PXC database in the CLI of the node. It is found that the file is empty and the file content is lost.

 

·     The database cannot be connected.

 

·     The pods that use the database service are abnormal.

 

·     The following error messages are displayed for the PXC Pod:

 

Solution

1.     Execute the following command to stop container pxc-node1 that failed to start up.

kubectl delete -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node1.yaml

2.     Execute the vim grastate.dat command, add the following contents to the file, and then save the file.

# GALERA saved state

version: 2.1

uuid:    2013b697-a063-11ed-b00e-d340082886cf

seqno:   -1

safe_to_bootstrap: 1

Figure 14 Viewing the grastate.dat file content

 

3.     Execute the following command to restart the stopped container pxc-node1.

kubectl apply -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node1.yaml

4.     Execute the kubectl logs -f command to view the startup logs of the database container. The database has been repaired if a message "mysql state is Synced message" is printed.

5.     Verify correct connectivity of the database.

 

 

Power outage resulting in loss of all disk file content required to start the PXC database

Symptom

A power outage caused the loss of all disk file content required to start the PXC database (the grastate.dat file was cleared). The abnormal Pod cannot connect to the PXC database.

Solution

To repair the grastate.dat file:

1.     Connect to the backend of the node where the abnormal Pod is located (switch to the root user with sudo -i if you are a non-root user).

2.     Enter the PXC data directory by executing the cd /var/lib/ssdata/pxc/pxc/* command.

3.     Execute the cat grastate.dat command to check whether the file is empty. Typically, the file is as shown in the following figure.

Figure 15 Viewing the file

 

4.     If the file content is empty, use the following command to complete it (the red part is the content of the file):

[root@mmp-001 1]# cat > grastate.dat <<- EOF

# GALERA saved state

version: 2.1

uuid: 938a614c-8c38-11ee-a589-8a75c3c33cea

seqno: -1

safe_to_bootstrap: 0

EOF

In a standalone environment, you only need to make sure the file format is correct without editing values such as UUID. They are automatically generated upon startup. In a cluster environment, you can obtain the file content from another node that is operating correctly.

5.     After repairing the file, execute the following command to bulk delete the database cluster Pods and restart pxc-node.

kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod

Figure 16 Command view

 


Troubleshooting server restart issues

Power outage resulting in ZooKeeper data sync failure between three nodes

Symptom

When a power outage occurs on the three nodes, the ZooKeeper data cannot be synced between them. After power is restored and services are restarted, you cannot guarantee that ZooKeeper on all three nodes will start simultaneously, resulting in sync failure of ZooKeeper persistent data.

Solution

To resolve the issue, rebuild ZooKeeper by deleting and restarting the three ZooKeeper Pods, as shown in the following figure.

Figure 17 Rebuilding ZooKeeper

 

Server power outage resulting in damage of the XFS file and Vertica database

Symptom

After a server power outage, SSH login to the backend fails. Restarting the sshd service triggers an emergency mode, indicating that the XFS file system is damaged due to the exceptional power outage. Verticaluanch and verticaluanch-10s frequently restart, indicating an issue with the Vertica database.

Solution

To repair the XFS file system:

1.     Use the lsblk command to locate the mount path and the umount command to unmount it, ensuring the partition is in an unmounted state.

Figure 18 Viewing the mount path

 

2.     Execute the xfs_repair -n command to check whether the file system is damaged.

3.     Execute the xfs_repair  -L /dev/sda1 command to repair the file system.

4.     Execute the xfs_ncheck /dev/sda1 to check whether the file system has been successfully repaired.

CAUTION

CAUTION:

Executing the xfs_repair command with the -L option to repair the XFS file system will clear the log and might result in the loss of user data and files. Make sure you understand the impact before executing the command in this way.

 

5.     Verify the result

a.     After you repair the disk with damaged files and restart it, SSH login to the node recovers to normal status, and the Pods resume normal operation.

b.     The database exception is associated with the issue described in corrupted ETCD DB files. The issue prevents associated tasks from being started. The issue is cleared after you rebuild the database of the node.

Kafka startup failure due to ZooKeeper data sync issue upon restart

Symptom

A cluster power outage results in a Kafka anomaly, preventing Campus from receiving replies from Kafka during interactions with devices through WebSocket. As a result, all interconnect links between leaf and spine devices turn red in the topology.

Solution

To resolve this issue:

1.     Clear the data in the /var/lib/ssdata/zookeeper and /var/lib/ssdata/kafka directories on all nodes.

2.     Restart the Pods associated with Kafka and ZooKeeper.

Execute the delete command on Kafka and ZooKeeper Pods in sequence.

Figure 19 Restarting Pods

 

Forcible power shutdown resulting in damage to the operating system's XFS partition upon server restart and system anomalies

Symptom

A forcible on-site power shutdown results in damage to the operating system's XFS partition upon server restart. This results in system anomalies, preventing access to the operating system or triggering the emergency mode.

Solution

Enter emergency mode or single-user view to forcibly repair the damaged partition. After that, restart the server. Then you can access the operating system correctly.

To repair the XFS file system:

1.     Use the lsblk command to locate the mount path and the umount command to unmount it, ensuring that the partition is in an umounted state.

Figure 20 Viewing the mount path

 

2.     Execute the xfs_repair -n command to identify whether the file system is damaged.

3.     Repair the file system by executing associated commands such as xfs_repair -L /dev/sda1.

4.     Execute the xfs_ncheck /dev/sda1 to check whether the file system has been successfully repaired.

CAUTION

CAUTION:

Executing the xfs_repair command with the -L option to repair the XFS file system will clear the log and might result in the loss of user data and files. Make sure you understand the impact before executing the command in this way.

 


Troubleshooting Unified Platform upgrade failures

After you upgrade Unified Platform, the k-eureka pod and others are abnormal and in Pending state.

Symptom

After you upgrade Unified Platform, execute the kubectl get pod -A | grep eure command to check the status of pods. In the command output, some pods, for example, the k-eureka pod, are in Pending state.

Figure 21 k-eureka-2 pod in Pending state

 

Execute the kubectl describe pod command. The system prompts an error that the node hosting the k-eureka pod does not meet the pod affinity/anti-affinity rules.

Figure 22 Pod affinity/anti-affinity rules are not met

 

When you check the affinity/anti-affinity rules configured for the node where the pod is located, you find that the node meets the requirements.

The issue occurs due to the residual information of the deleted k-eureka pod in the kube-scheduler cache. When a pod with the same name is started again, scheduling anomalies might occur.

Solution

Restart the kube-scheduler pod on each master node.

Perform the following tasks in the backend of one of the master nodes:

1.     Execute the kubectl get pod -A -owide | grep kube-scheduler command to obtain name of the kube-scheduler pod.

Figure 23 Viewing the kube-scheduler pod name

 

2.     Execute the kubectl delete pod -n kube-system pod-name command , where pod-name parameter is the pod name obtained in step 1. In a cluster environment, you can delete multiple kube-scheduler pods simultaneously. Replace the pod-name parameter with the names of three kube-scheduler pods, separated by spaces.

Figure 24 Deleting multiple kube-scheduler pods in a cluster environment

 

3.     After the kube-scheduler pod restarts, the abnormal pods will restore to normal state after a while. If the issue persists, contact Technical Support for help.

 


Troubleshooting RDRS

If you delete the RDRS on the primary site after the primary and backup sites are disconnected, residual configurations exist on the backup site. 

Symptom

In an RDRS scenario, if the backup site goes down or the primary site experiences network issues, deleting the RDRS on the primary site will leave residual RDRS configurations on the backup site. As a result, the components in the backup site remain in RDRS mode.

Solution

Cause

Because network issues occur between the primary site and the backup site or the backup site is down, the backup site fails to receive and execute requests from the primary site, resulting in residual configuration.

Solution

Execute the script to delete the backup site configuration to resolve this issue. More specifically:

Execute the following command on any node in the backup site. If the script execution log prompts success, it means the repair was successful.

sh  /opt/matrix/app/install/metadata/UCENTER/general/rdr/scripts/clearMemberRdrConfig.sh

 

During the process of primary/backup switchover, a network issue between the primary and backup sites caused the new backup site to become inaccessible.

Symptom

In an RDRS scenario, during the process of primary/backup switchover, network anomalies between the primary and backup sites cause the new backup (originally primary) site to become inaccessible. As a result, the page is displayed abnormally.

Solution

Cause

Due to network issues during the primary/backup switchover, the new backup site cannot establish a connection to the new primary site while switching to the backup role. As a result, a database anomaly occurs.

Solution

1.     If the new primary site is displayed as failing to switch to the primary role on the RDRS page, first make sure the network between the primary and backup sites has recovered. Then, log in to the Unified Platform page of the new primary site, navigate to the System > Emergent Recovery > RDRS page, and click the  icon for an RDRS component. Wait for the switchover to succeed.

2.     If the new primary site is displayed as in primary state on the RDRS page, first make sure the network between the primary and backup sites has recovered. Then, execute the sh  /opt/matrix/app/install/metadata/UCENTER/general/rdr/scripts/forceDropProduct.sh command on any node at the backup site. If the script execution log prompt a message as shown in the following figure, the repair succeeds.

 

The primary site cannot automatically switch to the backup role though the backup site successfully takes over in automatic switchover mode

Symptom

In auto switchover mode with arbitration, the primary site is abnormal, and the backup site automatically takes over. After the original primary site recovers from the anomaly, the current primary site cannot automatically switch to the backup role. As a result, two primary sites exist.

Solution

To resolve the issue:

1.     First, make sure the original primary site has restored to normal. Then, log in to the Unified Platform page of the new primary site, navigate to the System > Emergent Recovery > RDRS page, and change the switchover mode to manual. Click the  icon for an RDRS component. Wait for the switchover to succeed.

2.     If the issue persists, contact H3C Support.

Components in primary state on both the primary and backup sites and service exceptions

Symptom

All components on the primary and backup sites are in primary state. Log in to the Unified Platform page of the primary and backup sites. The controller menus on both sites are displayed normally.

Solution

·     Possible reasons include:

In manual switchover mode, when the network between the primary and backup sites is disrupted, specify the primary role on the backup site page. The backup component takes over successfully. The original primary site does not receive the switchover to backup request due to network issues. As a result, the original primary site fails to switch to the backup role and ultimately two primary sites exist in the system.

·     To resolve the issue:

First, make sure the original primary site has restored to normal. Then, log in to the Unified Platform page of the new primary site, navigate to the System > Emergent Recovery > RDRS page, and change the switchover mode to manual. Click the  icon for an RDRS component. Wait for the switchover to succeed.

Restart of some nodes in primary/backup sites, and PXC data synchronization anomaly

Symptom

After some nodes in the primary and backup sites restart, data synchronization of the PLAT Percona XtraDB Cluster (PXC) component on the RDRS page might become abnormal.

Solution

Cause

In the PXC disaster recovery system, data is synchronized by establishing the primary/backup relationship between pxc-node1 of the primary site and pxc-node1 of the backup site. If some nodes in the primary and backup sites restart, including pxc-node1 of either site, data synchronization will be disrupted.

Solution

After you verify that the restarted node and its pxc-node have restored to normal state, identify whether the PXC data synchronization is abnormal. If yes, log in to the Unified Platform GUI of the primary site, navigate to System > Emergent Recovery > RDRS page, find the PLAT Percona XtraDB Cluster component under SYSTEM, and then click the Synchronize Data icon in the Actions column. Wait for the repair to complete.

 


Troubleshooting Kafka anomalies

Service anomalies when the Kafka instance is in Running state

Symptom

Scenario 1:

If a Kafka Pod continuously outputs a large number of error logs, execute the kubectl logs -nservice-software -f abnormal Kafka Pod name | grep "Error processing append operation on partition __consumer_offsets” command. Alternatively, access the backend node directory of the faulty Kafka Pod at /var/lib/ssdata/log/kafka/broker2 (assuming kafka-2 pod service is faulty), and locate the most recent server.log or kafka-server.log file. Identify whether the file contains a large number of error logs of Error processing append operation on partition __consumer_offsets or java.nio.BufferOverflowException. If the previous symptom persists, the Kafka Pod service is faulty.

Scenario 2:

The Kafka Pod is in normal status, but the associated services are unable to consume messages. For example, the service might report an error message that org.apache.kafka.common.errors.TimeoutException: Failed to update metadata.

Scenario 3:

The Kafka Pod is in normal status, but the part of the associated services cannot consume messages. For example, the service might report an error message that Offset commit failed on partition.

Scenario 4:

The Kafka pod is running correctly but messages of the associated business services cannot be consumed. For example, the service might report an error message that "org.apache.kafka.common.errors.TimeoutException: Failed to update metadata."

Scenario 5:

If the kafka pod status is normal but messages of all associated business services cannot be consumed, access the kafka pod to view group details:

kubectl exec -it -nservice-software   kafka pod name 

-- bash -c "/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka-svc-0:9093,kafka-svc-1:9093,kafka-svc-2:9093  --group G_iMC_5 --describe"

The error message that "The coordinator is loading and hence can’t process requests." is reported.

Scenario 6:

Access kafka to view the group status:

kubectl exec -it -nservice-software   kafka pod name 

-- bash -c "/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server kafka-svc-0:9093,kafka-svc-1:9093,kafka-svc-2:9093  --list "

The system prompts an error message that it cannot respond to requests.

If the preceding issues persist, it indicates that the current kafka pod service is abnormal.

Scenario 7:

All kafka pod instances are normal, but a node in kafka has data corruption, causing the corrupted topic to be unreadable and unwritable. The service side reports an error message like "Caused by: java.lang.IllegalStateException: Topic(s) [SA_AI_Alarm_Aggregate] is/are not present and missingTopicsFatal is true.”

Solution

Analysis

In the previous scenarios, communication issues between Kafka and ZooKeeper can result in loss of temporary nodes registered by some Kafka Server nodes in ZooKeeper or controller switchover within Kafka itself. As a result, metadata might fail to be updated to ZooKeeper or the Kafka cache in time, causing inconsistencies between the metadata in ZooKeeper and on Kafka Server nodes. Consequently, metadata might be corrupted, and all or some Kafka service Pods are unable to provide services.

Troubleshooting procedure

To restart the Kafka Pod to resolve the issue:

For scenarios 1 and 2: Directly execute the kubectl delete pod -nservice-software abnormal kafka-pod name command to resolve the issue.

For scenarios 3, 4, 5, and 6: Delete the three Kafka Pods separately (one Pod for a single-node cluster) to resolve the issue. Execute the following commands:

kubectl delete pod -nservice-software kafka-0-xxxxx

kubectl delete pod -nservice-software kafka-1-xxxxx

kubectl delete pod -nservice-software kafka-2-xxxxx

Where xxxxx represents the suffix of the Kafka instance Pod name, as shown in the following figure.

Figure 25 Executing the command to delete Kafka Pods

 

For scenario 7: You must  delete data from topics that report errors on an abnormal node and then restart the kafka pod on that node as follows:

2.     Access the node that reports errors and execute the cd /var/lib/ssdata/kafka/broker*/kafka-log-data command to access the directory of the node with corrupted files.

3.     Execute the following commands in sequence:

sudo rm -rf *

kubectl delete pod -nservice-software abnormal kafka pod name

Abnormal Kafka instance status upon system power outage or restart

Symptom

After a node system powers off or restarts, the Kafka service might have Pods that are not in Running state. Such Pods continuously restart with an increasing restart count, and the their status switches between CrashLoopBackOff and Running. The following figure shows the Kafka Pods that are running correctly

Figure 26 Kafka Pods that are running correctly

 

Solution

A power outage or system restart on the node might damage the data files of the Kafka instance running on that node. As a result, the associated Pod keeps in a continuous restart status.

You can use the kubectl logs -nservice-software -f abnormal kafka-pod name | grep "Found a corrupted index file” command to check for any corruption in Kafka data files.

1.     Execute the kubectl get pod -nservice-software -owide | grep kafka command to locate the node where the faulty Pod resides.

Figure 27 Locating the node where faulty Pod resides

 

2.     Execute the cd /var/lib/ssdata/kafka/broker2/kafka-log-data command to access the directory of the node with corrupted files.

3.     Execute the following commands in sequence:

sudo rm -rf *

ubectl delete pod -nservice-software abnormal kafka-pod name

4.     Wait for the faulty Pod to return to normal status.

 

 

NOTE:

Note that in a three-instance Pod cluster environment for Kafka, file damage on one node does not affect Kafka services. File damage on more than one node might affect the Kafka services. In a single-node instance Pod environment, file damage cannot guarantee normal Kafka service operation or data integrity. After recovery through the previous steps, Kafka can resume normal operations, but data loss might occur.

 

Unexpected power outage resulting in Kafka data loss

Symptom

The unstable installation environment causes the server to shut down unexpectedly every night, leading to Kafka data loss and pod anomalies.

Solution

Restart the Kafka middleware.

Figure 28 Restarting the Kafka middleware

 


Troubleshooting Redis

itom-central-redis cluster data synchronization anomaly

Symptom

In a cluster environment, data synchronization between the master node and slave1 and slave2 nodes in the itom-central-redis cluster might experience significant delays. As a result, data cannot be synchronized timely. If a master/slave switchover occurs at this moment, temporary data loss might occur.

Solution

Cause

itom-central-redis uses the Redis master/slave mode, but sometimes a significant delay exists in data synchronization between the master and slave nodes. This issue usually occurs when certain master and slave nodes restart or when the network is abnormal, causing full synchronization to be incomplete or disrupted.

Solution

The itom-central-redis cluster contains three pods: itom-central-redis-master, itom-central-redis-slave-1, and itom-central-redis-slave-2. As long as the pod status is in Running state, the priority order of acting as the master Redis node is as follows: itom-central-redis-master > itom-central-redis-slave-1 > itom-central-redis-slave-2. More specifically, if itom-central-redis-master is available, itom-central-redis-master acts as the master node. If itom-central-redis-master is unavailable and itom-central-redis-slave-1 is available, then itom-central-redis-slave-1 acts as the master node. When both itom-central-redis-master and itom-central-redis-slave-1 are unavailable, itom-central-redis-slave-2 acts as the master node. Use the kubectl get pod -n service-software | grep itom-central-redis command to view the pod status. In the second column, the value of 1/1 indicates the master mode and the value of 0/1 indicates a slave node. During the check process, if itom-central-redis-master is the master node, you must check both itom-central-redis-slave-1 and itom-central-redis-slave-2. If itom-central-redis-slave-1 is the master node, only check itom-central-redis-slave-2.

1.     Identify whether the synchronization process between master and slave is still in progress. More specifically:

¡     Execute the following command on any k8s master node to identify whether the master_sync_in_progress status is 1. If the status is 1, it indicates synchronization is in progress and no additional processing is needed. You can execute the following command multiple times until the master_sync_in_progress status changes to 0, which indicates that data synchronization is complete. (The next two lines are the complete command. If you are using version E072x, replace redis-cli with mkvdb-cli in the command.)

kubectl exec -it -n service-software $(kubectl get pod -n service-software |grep itom-central-redis-slave-1|awk '{print $1}') -- redis-cli -p 30011 info replication

¡     To check the synchronization delay between itom-central-redis-slave-2 and the master node, execute the following command, and identify the result in the same way as the preceding step. (The next two lines are the complete command. If you are using version E072x, replace redis-cli with mkvdb-cli in the command.)

kubectl exec -it -n service-software $(kubectl get pod -n service-software |grep itom-central-redis-slave-2|awk '{print $1}') -- redis-cli -p 30011 info replication

2.     After the preceding checks are completed, the delay between master and slave is increasing if the following conditions exist:

¡     The master_sync_in_progress status is 0.

¡     The slave_repl_offset is smaller than the master_repl_offset, and the difference is gradually increasing.

¡     The master_last_io_seconds_ago value also increases.

In this case, you must restart the corresponding slave node pod. For example, if you discover a synchronization delay between itom-central-redis-slave-2 and the master node, execute the following command to restart the pod, resynchronize the data, and attempt to resolve this issue.

kubectl delete pod -n service-software $(kubectl get pod -n service-software |grep itom-central-redis-slave-2|awk '{print $1}')

 

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Intelligent Storage
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
  • Technical Blogs
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网