H3C Unified Platform Troubleshooting Guide-E07xx-5W101

HomeSupportResource CenterDiagnose & MaintainTroubleshootingH3C Unified Platform Troubleshooting Guide-E07xx-5W101
Download Book
Table of Contents
Related Documents

H3C Unified Platform

Troubleshooting Guide

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Document version: 5W101-20221018

 

Copyright © 2022 New H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.


Contents

Introduction· 1

General guidelines· 1

Collecting log and diagnostic information· 1

Troubleshooting operating systems· 2

H3Linux installation failure· 2

Symptom·· 2

Solution· 2

Troubleshooting cluster nodes· 3

Node hardware failure· 3

Symptom·· 3

Solution· 3

Application deployment failure because of abnormal node state· 3

Symptom·· 3

Solution· 3

Unified Platform unavailable because of cluster node exceptions· 4

Symptom·· 4

Solution· 4

Containers get stuck in evicted state because of insufficient disk space· 4

Symptom·· 4

Solution· 5

K8s node gets stuck in Not Ready state after a hugepages size change· 5

Symptom·· 5

Solution· 5

Cluster network mode change failure· 6

Symptom·· 6

Solution· 6

Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade· 7

Symptom·· 7

Solution· 7

Error occurrence on Pods including calico-node with error message Delegation not available for unit type  8

Symptom·· 8

Solution· 8

Troubleshooting cluster upgrade failure· 10

Matrix node upgrade failed and HARBORINSTALL-ERROR occurred· 10

Symptom·· 10

Solution· 10

Troubleshooting network errors for containers· 11

Network errors caused by network service restart 11

Symptom·· 11

Solution· 11

Troubleshooting cluster expansion failures· 12

Cluster deployment failure with successful configuration rollback· 12

Symptom·· 12

Solution· 12

Failed configuration rollback after cluster deployment failure· 12

Symptom·· 12

Solution· 12

Troubleshooting blocked access to Matrix services· 13

The cluster rejects access to any Matrix service if the security policy drops all access requests· 13

Symptom·· 13

Solution· 13

Troubleshooting Matrix login failure because of incorrect password· 14

Login failure occurs because of incorrect password· 14

Symptom·· 14

Solution· 14

Troubleshooting default route loss· 16

Default routes get lost after the ifconfig command restarts the NIC· 16

Symptom·· 16

Solution· 16

Troubleshooting ETCD service failure· 17

ETCD service startup failure· 17

Symptom·· 17

Solution· 18

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk· 21

Symptom·· 21

Solution· 22

Troubleshooting docker service failure· 23

No response after executing docker commands· 23

Symptom·· 23

Solution· 23

Troubleshooting failures caused by power cycle of the server or network disconnection  24

Files in the operating system were lost after power cycle of the server 24

Symptom·· 24

Solution· 24

Files where Matrix is dependent were lost after power cycle of the server 24

Symptom·· 24

Solution· 25

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server 25

Symptom·· 25

Solution· 25

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server 26

Symptom·· 26

Solution· 26

Troubleshooting cluster deployment failure· 28

Cluster deployment failed and a K8SINSTALL-ERROR occurred· 28

Symptom·· 28

Solution· 28

Troubleshooting Unified Platform deployment failure· 29

Unified Platform deployment failed because the kubectl exec command failed· 29

Symptom·· 29

Solution· 29

Troubleshooting cluster deployment failure in an IPv6 environment 30

Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC  30

Symptom·· 30

Solution· 30

Troubleshooting Unified Platform access failure· 31

Slow request processing caused by ETCD I/O latency· 31

Symptom·· 31

Solution· 31

Troubleshooting GlusterFS Pod anomaly after a node reboot or network change  32

Files with the garbled ??? filename are displayed in the mounted directory on the host or service container and the files are not accessible· 32

Symptom·· 32

Cause analysis· 32

Solution· 32

Troubleshooting GlusterFS deployment failure after Matrix uninstallation· 33

GlusterFS deployment fails because the disk or disk partition it uses has residual data· 33

Symptom·· 33

Cause analysis· 33

Solution· 33

Failed to install components that use GlusterFS storage because storage volumes cannot be deleted· 34

Symptom·· 34

Cause analysis· 34

Solution· 34

Failed to upgrade components that use GlusterFS storage because glusterd exits· 34

Symptom·· 34

Solution· 35

Troubleshooting IP editing failure· 36

Failed to edit the VIPs· 36

Symptom·· 36

Solution· 36

Troubleshooting image corruption· 37

Image corruption· 37

Symptom·· 37

Solution· 37

Troubleshooting failure of PXC Pods because of server power cycle, network disconnection, or switch from standalone mode to cluster mode· 38

Failed to start the PXC database· 38

Symptom·· 38

Solution· 38

Disk file corruption of PXC database· 39

Symptom·· 39

Solution· 39

PXC database in abnormal state· 40

Symptom·· 40

Cause analysis· 40

Solution· 41


Introduction

This document provides information about troubleshooting common software and hardware issues with Unified Platform.

General guidelines

To help identify the cause of issues, collect system and configuration information, including:

·     Unified Platform version and operating system version.

·     Symptom, time of failure, and configuration.

·     Network topology information, including network diagram, port connections, and points of failure.

·     Log messages and diagnostic information. For more information about collecting this information, see "Collecting log and diagnostic information."

·     Steps you have taken and the effect.

Collecting log and diagnostic information

1.     Enter https://node_ip_address:8443/matrix/ui in the address bar and then press Enter. node_ip_address represents the IP address of the master node.

2.     Enter the password and username and then click Login.

3.     Click the set icon  in the upper right corner of the page, and then select Export logs.

Figure 1 Exporting log messages

 


Troubleshooting operating systems

H3Linux installation failure

Symptom

All disks are selected by default for disk partitioning in the operating system. However, in the actual use, you might only need to select some disks. For example, if a USB disk is used for operating system installation, you must unselect the disk where the USB disk resides. As shown in Figure 2, when the second disk is unselected or only the first disk is selected, the partitioning scheme that is automatically loaded does not have the /var/lib/etcd partition.

Solution

This symptom might occur if the second disk is not selected during operating system installation for partitioning. This is because the partioning scheme is customized by default on the prerequisite that all disks are selected. If multiple disks are present, the system partition resides on the first disk, the ETCD partition resides on the second disk. If the second disk is not selected, the /var/lib/etcd partition is missing from the partitioning scheme.

To resolve the issue:

If the /var/lib/etcd partition is missing from the partitioning scheme that is automatically loaded, manually mount the partition as follows:

1.     Click . In the window that opens, specify the /var/lib/etcd mount point, and set the expected capacity to 50GiB.

2.     Click Add Mount Point.

3.     Configure the partition settings, and then click OK.

 


Troubleshooting cluster nodes

Node hardware failure

Symptom

A node in the cluster fails and needs to be replaced.

Solution

This symptom might occur if an unrecoverable hardware failure occurs on a node and causes the node to operate incorrectly.

To resolve the issue:

1.     Replace the faulty node with a new server. Make sure the new server uses the same IP address, username, and password as the faulty node.

2.     Install the Matrix platform on the new node. For more information, see Unified Platform Deployment Guide.

3.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Repair to repair the node.

4.     If the issue persists, contact the support.

Application deployment failure because of abnormal node state

Symptom

The primary master node in the cluster does not respond during application deployment. Matrix platform re-login fails by using the IP address of the node but succeeds by using the northern service virtual IP of the cluster. After re-login, the platform shows that the node is in abnormal state and is not the primary master node any more.

Solution

This symptom might occur when a network error is present or the node experienced an unexpected power failure.

To resolve the issue:

1.     Verify that the failed node and the Matrix platform can reach each other by using the ping command and verify that the node is not powered off.

2.     Log in to the Matrix platform and click the set icon  of the failed node from the Cluster page to repair the node.

3.     Deploy the application again.

4.     If the issue persists, contact the support.

Unified Platform unavailable because of cluster node exceptions

Symptom

The following issues are all present:

·     Failed to log in to Unified Platform.

·     On the Matrix platform, a master node is in red and you cannot ping the abnormal node.

·     Pods in running state reside on the abnormal node.

·     Execute the kubectl get endpoints -nservice-software itom-central-login-svc command to identify the endpoints for the itom-central-login service. If the Pod IP addresses on the abnormal node remain on these endpoints, an exception occurs.

Figure 2 Identifying endpoints for the itom-central-login service

 

Solution

1.     Access the CLI of the abnormal node and execute the kubectl drain nodeName --ignore-daemonsets --force --delete-local-data --timeout=1800s command to remove all Pods from the abnormal node. The nodeName argument represents the name of the abnormal node.

2.     Execute the kubectl delete node nodeName command to remove the abnormal node. The nodeName argument represents the name of the abnormal node.

3.     Repair the nodes that are abnormally disconnected. If the server hardware failure cannot be recovered, replace the server.

4.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Rebuild to rebuild the node.

5.     If the issue persists, contact the support.

Containers get stuck in evicted state because of insufficient disk space

Symptom

If the kubectl get pods --all-namespaces command is executed on a node with full disk space, a large number of containers on the node enters evicted state. These containers stay in the state even if disk space is manually released.

Solution

This symptom might occur if the disk space is insufficient on a node. In this case, the K8s mechanism generates a large number of containers in evicted state.

To resolve the issue:

1.     Release disk space in the root partition to reduce disk usage.

2.     Log in to the Matrix platform. Click Deploy on the top navigation bar and then select Cluster from the navigation pane. Then, click the set icon  for the target node, and select Repair to repair the node. The K8s mechanism will automatically delete containers in evicted state.

3.     If the issue persists, contact the support.

K8s node gets stuck in Not Ready state after a hugepages size change

Symptom

After you change the hugepages size, for example, changing the value of the GRUB_CMDLINE_LINUX parameter in file /etc/default/grub from "crashkernel=auto rhgb quiet default_hugepagesz=2M hugepagesz=2M hugepages=8192" to "crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16", the K8s node gets stuck in Not Ready state even after the system restarts.

Solution

Kubelet does not support huge pages of different sizes. When you change the size of huge pages, a new hugepages size directory is created in /sys/kernel/mm/hugepages/, where the previous hugepages size directory is still retained. This can cause K8s node state synchronization failure and therefore causes the K8s node to get stuck in Not Ready state.

This issue can be resolved by setting the number of huge pages of one size to 0. This section sets the number of 2M huge pages to 0.

To resolve the issue:

1.     Edit the hugepages configuration file.

a.     Use a vi editor to open the configuration file.

[root@node1 ~]# vi /etc/default/grub

b.     Press i to enter edit mode and configure settings in the file as follows:

GRUB_TIMEOUT=5

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"

GRUB_DEFAULT=saved

GRUB_DISABLE_SUBMENU=true

GRUB_TERMINAL_OUTPUT="console"

GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet default_hugepagesz=1G

hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"

GRUB_DISABLE_RECOVERY="true"

c.     Press ESC to exit edit mode.

d.     Enter :wq and press Enter to save the changes.

2.     Save configuration changes and restart the server for the changes to take effect.

¡     If the server started up in UEFI mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

¡     If the server started up in Legacy mode, perform the following tasks:

[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img

Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f

Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img

Done

[root@node1 ~]# reboot

3.     Verify that the number of 2M huge pages has been changed to 0.

[root@node1 ~]# cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=UUID=f47e3128-e888-499e-b370-2b381b6f3134 ro crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0

4.     If the issue persists, contact the support.

Cluster network mode change failure

Symptom

The network mode change for the cluster fails from the Matrix Web interface because of ETCD service exceptions on the master node.

Solution

The system initiates two ETCD requests for a network mode change, one for the change in calico and the other for the change on the Web interface. The Web interface prompts mode change failure if either request fails.

To resolve the issue:

1.     Verify that the ETCD service has recovered. If the service has not recovered, contact the support.

[root@name1 1.0.0]# etcdctl cluster-health

member fb58b3b32bac01c is healthy: got healthy result from http://10.99.212.82:2379

member aa6e53b313aa741f is healthy: got healthy result from http://10.99.212.81:2379

member d1fcbe1f6db25390 is healthy: got healthy result from http://10.99.212.83:2379

2.     If the network mode on the Web interface has been changed but the system prompts failure, change the network mode back to the previous mode. Make sure the mode has been changed back on the Web interface and then proceed to the next step.

3.     Try again to set the target network mode.


Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade

Symptom

A node is in red after Matrix upgrade and the node details indicate that a kube-apiserver, kubeScheduler, or kubeControllerManager exception has occurred. After accessing the CLI of the abnormal node and executing the kubectl get pod -A -owide command, the command out shows that the abnormal nodes has Pods in CrashLoopBackOff state.

Solution

The solution for the failure varies by situation.

Situation 1

·     Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding service ports are occupied and in LISTEN state.

·     Cause analysis:

If the old processes do not exit after Matrix upgrade and kube-apiserver port 6443, kube-scheduler port 10251, or kube-controller-manager port 10252 is not released, new Pods cannot start up. You can access the CLI of the node and execute kubectl logs -n kube-system $pod_name or docker logs $container_id to obtain log information about the occupied ports.

·     Solution:

The troubleshooting procedures are similar for the kube-scheduler, kube-apiserver, and kube-controller-manager services. This example troubleshoots the kube-scheduler service as follows:

a.     Remove the kube-scheduler Pod.

[root@name ~]# mv /etc/kubernetes/manifests/kube-scheduler.yaml /opt/

b.     Verify that all kube-scheduler containers have exited. If a containers does not exit after a long while, execute the docker rm -f $container_id to forcedly remove the container or execute the systemctl restart docker to restart the docker service.

[root@name ~]# docker ps | grep kube-scheduler

c.     Execute the netstat -anlp | grep -w 10251 to verify that the port is released. The port is released if no connection in LISTEN state is displayed from the command output.

d.     Start up the kube-scheduler Pod.

[root@name ~]# mv /opt/kube-scheduler.yaml/etc/kubernetes/manifests/

e.     Execute the kubectl get pod -n kube-system -o wide command to obtain the Pod status.

f.     If the issue persists, contact the support.

Situation 2

·     Symptom:

When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding ports are occupied and only connections in TIME_WAIT state exist. In addition, the ports are not occupied by the kube-apiserver, kube-scheduler, or kube-controller-manager process.

·     Cause analysis:

If the kube-apiserver, kube-scheduler, or kube-controller-manager Pod restarts during Matrix upgrade, port 6443, 10251, or 10252 might be occupied by GlusterFS. In this case, a Pod exception occurs.

·     Solution:

Contact the support.

Error occurrence on Pods including calico-node with error message Delegation not available for unit type

Symptom

After you edit the IP address of a node on the Matrix platform, the node is in red. Access the CLI of the abnormal node, execute the kubectl get pod -A -owide command and the command output displays that some Pods including calico-node and calico-kube-controller are abnormal.

The kubelet logs are as follows:

Error syncing pod 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b ("calico-node-vlpz8_kube-system(991e112f-c3a3-4c46-9a9b-dfde4ca0a27b)"), skipping: failed to ensure that the pod: 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod991e112f-c3a3-4c46-9a9b-dfde4ca0a27b] : Delegation not available for unit type

Solution

This symptom might occur if the containerd version is lower than containerd-v1.3.0. To obtain the containerd version, execute the containerd -v command from the CLI of the node.

To resolve the issue, execute the systemctl restart kubelet.service command on the node where an abnormal Pod resides to restart the kubelet service.

 


Troubleshooting cluster upgrade failure

Matrix node upgrade failed and HARBORINSTALL-ERROR occurred

Symptom

Matrix node upgrade failed. The node logs page accessed by clicking the setting icon  for the node displays HARBORINSTALL-ERROR.

Solution

This symptom might occur if a primary/secondary master node switchover occurred during the upgrade.

To resolve the issue:

1.     Log in to the Web interface of the node, and access the Deploy > Cluster page. Click the setting icon  for the master node to view master node log messages and identify the time range from K8SINSTALL-SUCCESS and HARBORINSTALL-ERROR.

Example log messages are as follows:

Fri Nov 06 15:12:28 CST 2020 [INFO]: K8SINSTALL-SUCCESS:

Fri Nov 06 15:12:42 CST 2020 [INFO]: HARBORINSTALL-ERROR:

2.     Access the three servers on which the master nodes reside, and open the /opt/matrix/log/Matrix/system/system.log file. Examine if the following message was generated during the time range obtained in the previous step:

2020-11-06T12:36:29,857 Matrix Matrix/INFO/NODE_ROLE_CHANGED: -Module="TEAM"; Node 172.131.1.162 changed from LEADER to UNKNOWN

The message indicates that a primary/secondary master node switchover occurred during the upgrade.

3.     Upgrade the node again if a primary/secondary switchover occurred during the failed upgrade.

4.     If the issue persists, contact the support.


Troubleshooting network errors for containers

Network errors caused by network service restart

Symptom

If you restart the network service of a cluster configured with network settings, NICs of containers on the node get lost and network connection errors occur.

Solution

This symptom occurs because restarting the network service of a cluster configured with network settings deletes VLAN interface and container NIC settings configured at network creation.

To resolve the issue:

1.     Use the reboot command to reboot the node on which the container resides.

2.     If the issue persists, contact the support.


Troubleshooting cluster expansion failures

Cluster deployment failure with successful configuration rollback

Symptom

Cluster deployment fails for a standalone node, but configurations of the node can be rolled back correctly and the Start Deployment button is still available.

Solution

To resolve the issue:

1.     Use the following methods to deploy the cluster again:

¡     Access the Deploy > Clusters page, and then click Start Deployment.

¡     Click the set icon  for the failed nodes to delete the nodes. Then, click the plus icon to add the deleted nodes, and then click Start Deployment.

When the deployment progress of each node reaches 100%, the deployment succeeds.

2.     If the issue persists, contact the support.

Failed configuration rollback after cluster deployment failure

Symptom

After cluster deployment fails for a standalone node, configurations of the node cannot be rolled back and the Start Deployment button is unavailable.

Solution

To resolve the issue:

1.     Uninstall the software packages from all the master nodes, including the standalone node and the other two nodes used for cluster deployment.

2.     Re-install the software packages on the standalone node.

3.     Log in to the Matrix platform and restore the cluster and applications.

4.     If the issue persists, contact the support.


Troubleshooting blocked access to Matrix services

The cluster rejects access to any Matrix service if the security policy drops all access requests

Symptom

If you enable a security policy with the default action set to DROP and all rules removed, the cluster rejects all access requests to any Matrix services.

Solution

This symptom occurs because the default action is set to DROP and the default rule that permits access to Matrix port 8443 is removed.

To resolve the issue:

1.     Log in to a master node from the management port.

2.     Access the disaster recovery directory.

[root@node1 ~]# cd /opt/matrix/k8s/disaster-recovery/

3.     Restore the default security policy settings.

[root@node1 ~]# bash recover-security-policies.sh

4.     If the issue persists, contact the support.


Troubleshooting Matrix login failure because of incorrect password

Login failure occurs because of incorrect password

Symptom

An admin user fails to log in to Matrix because of incorrect password.

Solution

To resolve the issue:

1.     Access the script directory on a master node and execute the bash script_name reset_password command to run the password reset script. The script_name argument represents the script name and the reset_password argument represents the new password. The script name varies by matrix operating status. In this example, the new password is Pwd@123456.

¡     If the matrix operates correctly, run the password reset script named resetMatrixUserPassword.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/usermanage/reset_password

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100   110    0    84  100    26    251     77 --:--:-- --:--:-- --:--:--   252

+ return_info='{"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600}'

+ [[ {"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

¡     If the matrix cannot operate correctly, run the emergency password reset script named resetMatrixUserPassword_emergency.sh.

[root@node1 ~]# cd /opt/matrix/k8s

[root@node1 k8s]# bash resetMatrixUserPassword_emergency.sh Pwd@123456

+ MATRIX_ADMIN_PASSWD=Pwd@123456

++ curl -k -g -X POST -H Content-Type:application/json -H X-Access-Mode:emergen  cy -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/use rmanage/reset_password

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100   110    0    84  100    26    253     78 --:--:-- --:--:-- --:--:--   253

+ return_info='{"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admi n","expireTime":600}'

+ [[ {"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admin","expire  Time":600} =~ admin ]]

+ echo 'Password reset succeeded.'

Password reset succeeded.

2.     Log in to Matrix again by using the new password.

3.     If the issue persists, contact the support.


Troubleshooting default route loss

Default routes get lost after the ifconfig command restarts the NIC

Symptom

The configured default routes get lost after you execute the ifconfig command on a node in the cluster to restart the NIC of the node.

Solution

To resolve the issue:

1.     Access the CLI of the node and then use the systemctl restart network command to restart the network service.

[root@node01 ~]# systemctl restart network

2.     Use the route -n command to verify that the default routes have been restored. The example output from this command is as follows:

[root@node01 ~]# route -n

Kernel IP routing table

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

0.0.0.0         10.99.212.1     0.0.0.0         UG    0      0        0 eth0

10.99.212.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0

192.168.122.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0

3.     If the issue persists, contact the support.


Troubleshooting ETCD service failure

ETCD service startup failure

Symptom

ETCD service startup failure occurs because of ETCD storage data file corruption or file loss when a node is power cycled.

·     After a node is power cycled, the ETCD service on it fails to start up because of database file corruption. This event eventually causes the cluster to malfunction.

The following log is displayed:

panic: freepages: failed to get all reachable pages (page 1407374894039040: out of bounds: 1264)

goroutine 116 [running]:

panic(0x55a1d6cce4a0, 0xc420202ef0)

        /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:551 +0x3c5 fp=0xc42006bf60 sp=0xc42006bec0 pc=0x55a1d5f0ae25

github.com/coreos/bbolt.(*DB).freepages.func2(0xc42020c180)

...

·     Data restoration failure causes file corruption.

Typically, the snap log file must contain at least one log index value greater than the smallest log index value in the wal log file.

If all log index values in the wal log file are greater than those in the snap log file (as shown in Figure 4), the key operation log data will get lost and the ETCD service cannot recover data. In this case, file corruption occurs.

Figure 3 Viewing log index values in snap and wal log files

 

·     After a node is power cycled, the ETCD service on it fails to start up because of database snapshot file loss. This event eventually causes the cluster to malfunction.

The following log is displayed:

etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist

 

·     After a node is power cycled, the ETCD service on it fails to start up because of data file corruption. This event eventually causes abnormal node status.

The following log is displayed on the abnormal node:

"error":"walpb: crc mismatch"

Solution

To resolve the issue:

1.     Log in to each node, and execute the systemctl status etcd command to verify that the ETCD service is in running state.

[root@node01 ~]# systemctl status etcd

2.     If the database file for the ETCD service of only one node is corrupted, perform the following steps:

a.     Log in to Matrix.

b.     Click the  icon in the upper right corner of the target node on the Deploy > Clusters > Deploy Cluster page, and then select Rebuild to rebuild the node.

3.     If the database file for the ETCD service is corrupted on two nodes, use the emergency mode to resolve the issue on the two nodes one after another.

4.     If the database file for the ETCD service is corrupted on a standalone node or a three-node cluster, use one of the following methods:

¡     Method 1: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."

¡     Method 2:

-     Uninstall Matrix from all nodes.

-     Reinstall Matrix on all the nodes.

-     Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.

¡     Method 3:

-     Uninstall Matrix from all nodes.

-     Reinstall Matrix on all the nodes.

-     Log in to Matrix to redeploy the cluster and applications.

Solution for a standalone node

1.     Log in to the node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master1 ~]# systemctl status etcd

2.     Stop the Matrix service on the node.

¡     For a root account, execute the systemctl stop matrix command.

[root@master1 ~]# systemctl stop matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3.     Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service.

[root@master1 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4.     Stop the ETCD service.

¡     For a root account, execute the systemctl stop etcd command to stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[root@master1 ~]# systemctl stop etcd

[root@master1 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master1 ~]# ll /var/lib/etcd/

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5.     Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

6.     Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

¡     For a root account, execute the following commands:

[root@master1 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

¡     For a non-root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db

2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069

2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b

7.     Restart the ETCD service.

¡     For a root account, execute the systemctl restart etcd command.

[root@master1 ~]# systemctl restart etcd

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

8.     Restart the Matrix service.

¡     For a root account, execute the systemctl restart matrix command.

[root@master1 ~]# systemctl restart matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

9.     Restore the kube-apiserver service.

[root@master1 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

10.     Verify that the issue has been resolved.

a.     Use the VIP to log in to the GUI of the Matrix platform.

b.     On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status.

11.     On the top navigation bar, click Deploy, and then select Pods from the menu to view the running status of the Pods.

Solution for a three-node cluster

1.     Log in to each master node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.

[root@master2 ~]# systemctl status etcd

2.     Stop the Matrix service on each master node.

¡     For a root account, execute the systemctl stop matrix command.

[root@master2 ~]# systemctl stop matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"

3.     Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service on each master node.

[root@master2 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix

4.     Stop the ETCD service on each master node.

¡     For a root account, execute the systemctl stop etcd command to stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[root@master2 ~]# systemctl stop etcd

[root@master2 ~]# rm -rf /var/lib/etcd/default.etcd/

[root@master2 ~]# ll /var/lib/etcd/

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"

[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"

[admin@node4 ~]$ ll /var/lib/etcd/

5.     Access the directory where the ETCD service restoration script resides.

[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/

6.     Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.

For data consistency, make sure all nodes use the same backup file. If a node does not have the file, copy one from another node.

¡     For a root account, execute the following commands:

[root@master2 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091653.db

2021-08-06 06:33:14.788657 I | mvcc: restore compact to 273930

2021-08-06 06:33:14.802137 I | etcdserver/membership: added member 312131d4535cc53f [http://10.99.212.124:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802189 I | etcdserver/membership: added member 5fc2f82d74297956 [http://10.99.212.123:2380] to cluster cd6d5adc1bfd16f5

2021-08-06 06:33:14.802206 I | etcdserver/membership: added member ad12c65048f444bd [http://10.99.212.120:2380] to cluster cd6d5adc1bfd16f5

¡     For a root account, execute the following commands:

[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805014548.db

2021-08-06 01:22:10.876952 I | mvcc: restore compact to 12660679

2021-08-06 01:22:10.906116 I | etcdserver/membership: added member ac2cefc4cae84e25 [http://[2000::100:2000]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906174 I | etcdserver/membership: added member b4689a44b8c1f191 [http://[2000::100:2001]:2380] to cluster ced7b5d5ee633b40

2021-08-06 01:22:10.906197 I | etcdserver/membership: added member c328a554c1ca84f4 [http://[2000::100:2002]:2380] to cluster ced7b5d5ee633b40

7.     Restart the ETCD service on each master node.

¡     For a root account, execute the systemctl restart etcd command.

[root@master2 ~]# systemctl restart etcd

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"

8.     Restart the Matrix service on each master node.

¡     For a root account, execute the systemctl restart matrix command.

[root@master2 ~]# systemctl restart matrix

¡     For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.

[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"

9.     Restore the kube-apiserver service on each master node.

[root@master2 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/

10.     Verify that the issue has been resolved.

a.     Use the VIP to log in to the GUI of the Matrix platform.

b.     On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status.

11.     On the top navigation bar, click Deploy, and then select Pods from the menu to view the running status of the Pods.

12.     If the issue persists, contact the support.

 

ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk

Symptom

·     An ETCD client, such as a K8s or Matrix client, takes more than 800 ms to access the ETCD database. The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd directory on each master node.

2020-11-15 12:36:42.013987 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:295" took too long (877.352309ms) to execute

2020-11-15 12:36:54.026221 W | etcdserver: read-only range request "key:\"/registry/pods/base-service/\" range_end:\"/registry/pods/base-service0\" " with result "range_response_count:42 size:107232" took too long (1.767232614s) to execute)

·     Primary/standby switchover occurs frequently in the ETCD cluster, which might be caused by a heartbeat timeout.

Solution

This issue typically results from disk I/O performance degradation in a shared storage environment.

To resolve the issue:

1.     If the symptom occurs during execution of an application installation, upgrade, or configuration deployment task and results in task execution failure, perform the task again to fix the issue.

The system does not remove data synchronized during the execution of an installation, upgrade or configuration deployment task. Task re-execution will be likely to succeed because the amount of data to be synchronized decreases, causing less impact on disk I/O performance.

2.     If the symptom occurs during normal operation of the system, edit the matrixLeaderLeaseDuration (lease aging time) and matrixLeaderRetryPeriod (lease detection interval) parameters to delay the timeout for primary/backup switchovers.

However, this change will increase the failover time.

3.     If writing failure or data loss occurs because of poor disk I/O performance, use one of the following methods to fix the issue:

¡     If a Pod has a status or communication anomaly, use the kubectl delete pod -n namespace podName command to remove the Pod. The system will create a Pod automatically to restore the ETCD data source.

¡     Method 2: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."

¡     Method 3:

-     Uninstall Matrix from all nodes.

-     Reinstall Matrix on all the nodes.

-     Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.

¡     Method 4:

-     Uninstall Matrix from all nodes.

-     Reinstall Matrix on all the nodes.

-     Log in to Matrix to redeploy the cluster and applications.


Troubleshooting docker service failure

No response after executing docker commands

Symptom

No response is returned a long time after you execute the docker ps, docker images, docker inspect, and docker rmi commands.

Solution

1.     Restart the docker service.

¡     For the root user, execute the following command to restart the docker service:

[root@master1 ~]# systemctl restart docker

¡     For a non-root user, execute the following command to restart the docker service:

[admin@master1 ~]$ sudo /bin/bash -c "systemctl restart docker"

2.     Verify that the docker service becomes normal.

¡     For the root user, execute the docker images command to view the docker service.

¡     For a non-root user, execute the sudo /bin/bash -c " docker images " command to view the docker service.

The docker service becomes normal if the command output displays the image information of the current node.

 


Troubleshooting failures caused by power cycle of the server or network disconnection

Files in the operating system were lost after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The service file or contents in the service file for the docker or chrony service were lost.

·     The configuration files, for example, chrony.conf or hosts or the contents in the files in the /etc/ directory were lost.

·     The log files or contents in the log files in the /var/log directory were lost.

Solution

To resolve the issue:

·     The service file or contents in the service file for the docker or chrony service were lost:

a.     Execute the ls /usr/lib/systemd/system/service-name.service command to identify whether the service file exists on each node or whether no information exists in the file.

b.     If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c.     If the file does not exist in any node, contact the support or re-install the operating system.

·     The files or contents in the log files in the /etc/ and /var/log directories were lost:

Contact the support or re-install the operating system.

Files where Matrix is dependent were lost after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The service file or contents in the service file for the etcd or matrix service were lost:

·     The configuration files, for example, navigator_config.json or contents in the files in the /opt/matrix/ directory were lost.

·     The script files, for example, docker.sh, or contents in the files in the /opt/matrix/ directory were lost.

·     The Docker image files in the /var/lib/docker directory were damaged.

·     The YAML file or contents in the YAML file in the /opt/matrix/ directory were lost.

Solution

To resolve the issue:

·     The service file or contents in the service file were lost, or the files or contents in the files in the /opt/matrix/ directory were lost:

a.     Execute the ls command to identify whether the service file exists on each node or whether no information exists in the file.

b.     If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.

c.     If the file does not exist in any node, contact the support or re-install the operating system.

·     The Docker image files in the /var/lib/docker directory were damaged:

a.     Upload the Matrix package to rebuild the node.

b.     Contact the support.

The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during the Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

·     Some Pods in CreateContainerError state on the Monitor > Pods page, indicating that product-related Pods are abnormal.

To view all Pods in CreateContainerError state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep CreateContainerError command.

[root@node1 home]# kubectl get pod -A -owide | grep CreateContainerError

NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE   IP NODE       NOMINATED NODE   READINESS GATES

kube-system   calico-kube-controllers-cd96b6c89-hfz7s   0/1     CreateContainerError   0  29d 10.99.212.164    node1   <none>           <none>

Solution

To resolve the issue, use either of the following methods:

·     Method 1:

a.     Access the node where the abnormal Pod resides and execute the docker ps | grep podname | grep -v POD | grep Up|awk '{print $1}' command to obtain the ID of the container in up state. Podname represents the name of the abnormal Pod.

[root@node1 home]# docker ps |grep calico-kube-controllers-cd96b6c89-hfz7s | grep -v POD|grep Up|awk '{print $1}'

c755b7812380

b.     Execute the docker stop containerid && docker rm containerid command, for example, docker stop c755b7812380 && docker rm c755b7812380, to delete the container in up state.

c.     Execute the kubectl get pod -A -owide | grep CreateContainerError command to check for Pods in CreateContainerError state. If such Pods still exist, log in to the GUI of the Matrix platform to rebuild the node.

·     Method 2: Log in to the GUI of the Matrix platform and rebuild the node where the abnormal Pod resides.

The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server

Symptom

The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):

·     The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.

·     Some Pods in Error state on the Monitor > Pods page, indicating that product-related Pods are abnormal.

To view all Pods in Error state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command.

You can also access the CLI of the node where the Pod in Error state resides and execute the cat /var/log/matrix-diag/Matrix/kubelet/kubelet.log | grep "unexpected end of JSON input" command to view the kubelet logs about the node. If the following error message is displayed, Pod data is corrupted because of power cycle of the node and the Pod cannot start up.

Multus: failed to load netconf: unexpected end of JSON input

Solution

To resolve the issue, use either of the following methods:

·     Method 1 (applicable to scenarios where a small number of Pods are in Error state):

a.     Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command to view the namespaces and names of Pods in Error state.

b.     Execute the kubectl delete pod -n namespace podName command to delete a Pod in Error state.

To delete multiple Pods in Error state, execute the command multiple times.

·     Method 2 (applicable to scenarios where a large number of Pods are in Error state):

Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error| awk '{print $1 " " $2}'| xargs kubectl delete pod –n command. This command deletes all Pods in Error state.

 


Troubleshooting cluster deployment failure

Cluster deployment failed and a K8SINSTALL-ERROR occurred

Symptom

Cluster deployment failed. The node logs page accessed by clicking the setting icon  for the node displays K8SINSTALL-ERROR.

Solution

This issue might occur if the node has multiple NICs in up state.

If a NIC is not configured with an IP address:

By default, the arp_ignore parameter is set to 0 in an operating system and the system responds to all ARP requests for the local IP address, including requests for the loopback adapter's address. The matrix node might encapsulate the MAC address of the NIC not configured with an IP address in ARP responses, causing connection errors between cluster nodes.

To resolve the issue:

1.     Use the ip address command on the node to display the IP addresses of NICs and identify the NIC that is not configured with an IP address.

2.     Use the ifconfig nic_name down command to disable the NIC. For example, execute ifconfig eth0 down.

3.     Redeploy the matrix cluster.


Troubleshooting Unified Platform deployment failure

Unified Platform deployment failed because the kubectl exec command failed

Symptom

The system failed to deploy Unified Platform. The log shows that gfs volume creation failed because one node failed to execute the kubectl exec command. If you execute the kubectl exec -it pod bash command on that node, all Pods on the node are inaccessible.

Solution

To resolve the issue:

1.     Log in to the CLI of the node that failed to execute the kubectl exec command.

2.     Execute the systemctl restart kubelet.service command to restart the kubelet service on the node.

3.     If the issue persists, contact the support.


Troubleshooting cluster deployment failure in an IPv6 environment

Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC

Symptom

The rebuild or upgrade operation fails for a cluster if you add a vNIC to a node or add a new IP address to the vNIC on that node, because the IP address of other nodes in this cluster and this new IP address are not on the same subnet. The system prompts ping failure if you execute the ping6 pod_ip command at the CLI of that node. The pod_ip argument represents the IP address of the container. You can use the kubectl get pod -n kube-system -o wide command to obtain that IP address.

Solution

To resolve the issue:

1.     Change the new IP address to an IP address that is on the same subnet as other nodes.

2.     Configure a routing policy on other nodes so that these nodes can communicate with each other.

3.     If the issue persists, contact the support.


Troubleshooting Unified Platform access failure

Slow request processing caused by ETCD I/O latency

Symptom

Failed to access Unified Platform.

When you review the ETCD log, the following prompt is displayed:

context deadline exceeded, waiting for ReadIndex response took too long, retrying,

When you review the apiserver log and the following prompt is displayed:

stopped listening on [::]:6443

Solution

This symptom might occur when the API server failed to obtain data from ETCD multiple times because of ETCD latency, and then stopped listening to port 6443. In this case, the service components failed to use the K8s API through port 6443.

To resolve the issue:

1.     Verify that the disk I/O performance can meet the requirements. If the average I/O performance value is equal to or greater than 10000, the disk I/O performance meets the requirements. If the average I/O performance value is smaller than 10000, an error has occurred on the disk I/O performance, and you must improve the disk I/O performance.

To test the disk I/O performance:

¡     For the root user, execute the bash /opt/matrix/tools/env_check.sh –p command.

¡     For a non-root user, execute the sudo bash /opt/matrix/tools/env_check.sh -p command.

2.     Execute the kubectl get pod -n service-software | grep stolon-keeper command to obtain the names of all stolon-keeper Pods.

3.     Execute the kubectl delete pod -n service-software pod_name command to restart the stolon-keeper Pods one by one.

4.     Access Unified Platform after all stolon-keeper Pods recover to the running state.


Troubleshooting GlusterFS Pod anomaly after a node reboot or network change

Files with the garbled ??? filename are displayed in the mounted directory on the host or service container and the files are not accessible

Symptom

??? files are displayed when the ls –l command is executed in the mounted directory on the host or service container, and the files are not accessible.

Cause analysis

The remaining disk space on the nodes where the three replicas of the GlusterFS storage volume reside is inconsistent, resulting in data inconsistency between the three replicas after data writing. As a consequence, the GlusterFS storage volume data file has a brain split.

Solution

To resolve the issue:

1.     Execute the kubectl get po -A |grep glusterfs command to view the names and namespaces of the GlusterFS Pods.

2.     Execute the kubectl exec command to access the GlusterFS container and then execute the gluster volume heal VolumeName info command to view whether words Is in split-brainis are displayed in the command output and record the file paths. VolumeName is the name of the faulty storage volume.

3.     Use one of the following methods to resolve the issue:

¡     To resolve the issue based on the file size, execute the gluster volume heal VOLNAME split-brain bigger-file filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the largest file.

¡     To resolve the issue based on the update time of the file, execute the gluster volume heal VOLNAME split-brain latest-mtime filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the most recently generated file.

4.     Use the procedures provided at https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/ to resolve the issue.

5.     If the issue persists, contact the support.


Troubleshooting GlusterFS deployment failure after Matrix uninstallation

GlusterFS deployment fails because the disk or disk partition it uses has residual data

Symptom

GlusterFS fails to be deployed. The Matrix platform displays a message that the disk or disk partition GlusterFS uses has residual data.

Device include vg , nodename:node1, device:/dev/vda3

Cause analysis

Heketi for GlusterFS requires an empty disk or disk partition for deployment, but the disk or disk partition for GlusterFS deployment contained data. You must clean up the disk manually.

Solution

To resolve the issue:

1.     Access the directory where the disk cleanup script resides.

[root@m2 ~]# cd /

opt/matrix/app/install/metadata/gluster/gluster/scripts/tools/

2.     Execute the bash clearDisk.sh disks command to clean up disks, where disks represents the disks to clean up. You must enclose the list of disks in double quotation marks. If you are to clean up multiple disks or disk partitions, use spaces to separate them, for example, bash clearDisk.sh "/dev/vdb /dev/vdc".

[root@m2 ~]# bash clearDisk.sh "/dev/vdb /dev/vdc"

[clear_disk] CAUTION: Please confirm whether to erase the disk /dev/vdb /dev/vdc

Continue anyway? (Y/N) : y

[clear_disk] CAUTION: Please confirm whether to clear glusterfs config file

Continue anyway? (Y/N) : y

[clear_disk] Disk erase complete.

 

CAUTION

CAUTION:

Identify the disk or disk partition before cleaning up it.

 

3.     Repeat the preceding procedure on all the other master modes and then redeploy GlusterFS.

4.     If the issue persists, contact the support.

Failed to install components that use GlusterFS storage because storage volumes cannot be deleted

Symptom

The installation of a component that uses GlusterFS storage failed. The Matrix logs show that the system failed to delete storage volumes when the installation script uses the volume.sh script. When you execute commands to delete storage volumes on the server where the active master node resides, an error still occurs and the storage volumes fail to be deleted.

Cause analysis

For installation of a component that uses GlusterFS storage, the installation script will use GlusterFS heketi commands to delete and re-create the GlusterFS storage volume. Due to open source issues, during storage volume deletion, an error message appears indicating that the storage volume is mounted to the operating system. However, the mounting information of the operating system does not display any mounting information about the storage volume that causes the deletion failure of the storage volume.

Solution

To resolve the issue:

1.     Access the CLI of each master node and restart the server where the nodes reside. As a best practice, restart the standby master nodes and then the active master node.

2.     Deploy the components that use GlusterFS storage after the cluster recovers to normal.

Failed to upgrade components that use GlusterFS storage because glusterd exits

Symptom

1.     When the upgrade of a component that uses GlusterFS storage failed, access the CLI of the node and review the Matrix logs. The logs show that the upgrade of the component that uses GlusterFS storage has failed multiple times.

2.     Access the CLI of a master node, and execute the kubectl get po -A -owide | grep   glusterfs command to obtain the names of all the GlusterFS Pods in running state.

[root@matrix ~]# kubectl get po -A -owide | grep   glusterfs

glusterfs-example   glusterfs-l6fcr                1/1     Running   0   3d23h   10.99.212.200 matrix   <none>           <none>

glusterfs-example   heketi-848f8f7dd6-nc2kq                       1/1     Running   0 3d23h   177.177.95.77    matrix   <none>           <none>

glusterfs-example   monitor-84964d7cd7-2wjrr                      1/1     Running   0 3d23h   177.177.95.78    matrix   <none>           <none>

3.     Execute the kubectl exec -it -n glusterfs-example   glusterfs-l6fcr /bin/bash command to access GlusterFS Pods.

4.     Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command, but the glusterd process is not displayed in the command output.

Solution

This symptom might occur if the glusterd process in GlusterFS Pods abnormally exits during upgrade of components that use GlusterFS storage, which causes the failure of executing storage-related scripts.

To resolve the issue:

1.     Execute the kubectl get po -A -owide | grep   glusterfs command to obtain the names of all GlusterFS Pods in running state, for example, glusterfs-l6fcr.

2.     Execute the kubectl exec -it -n glusterfs-example   glusterfs-l6fcr /bin/bash command to access containers in the GlusterFS Pod.

3.     Execute the systemctl restart glusterd command to restart the glusterd process.

4.     Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command to identify whether the glusterd process has been restarted.

5.     After the glusterd process restarts, upgrade the components that use GlusterFS storage again.

 


Troubleshooting IP editing failure

Failed to edit the VIPs

Symptom

Failed to edit VIP settings after in the Advance section on the Deploy > Clusters > Cluster Parameters > Modify Cluster Parameters page. The Matrix logs show the following error messages:

2022-02-16T10:33:52,207 | INFO  | DeployResource-11-thread-1 | K8sClientHelper.getConfigMapByName:2120 | [K8sClientHelper] get configmap by name param: namespace kube-system, configmapName kube-proxy

2022-02-16T10:33:52,227 | ERROR | DeployResource-11-thread-1 | DefaultUncaughtExceptionHandler.uncaughtException:18 | uncaught exception in Thread[DeployResource-11-thread-1,5,main], stack: [java.lang.Thread.getStackTrace(Thread.java:1559), com.h3c.matrix.util.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:18), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052), java.lang.Thread.dispatchUncaughtException(Thread.java:1959)]

java.util.ServiceConfigurationError: io.fabric8.kubernetes.api.KubernetesResourceMappingProvider: Provider io.fabric8.kubernetes.internal.InternalResourceMappingProvider not found

Solution

This symptom might occur if Matrix failed to obtain ConfigMaps due to Fabric8 issues.

To resolve the issue, execute the systemctl restart matrix to restart the current node, and then edit VIP settings again.


Troubleshooting image corruption

Image corruption

Symptom

An image corruption error occurs if one of the following symptoms appears:

·     Symptom 1:

When a Pod is in ImagePullBackOff state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:

too many levels of symbolic links

·     Symptom 2:

When a Pod is in ImageInspectError state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:

readlink /var/lib/docker/overlay2/l: invalid argument"

The namespace argument represents the name space of the Pod, and the podName argument represents the name of the Pod.

Solution

To resolve the issue:

1.     Execute the following commands in sequence to delete all containers and images on the node where the faulty Pod resides:

[root@master1 ~]# systemctl restart docker

[root@master1 ~]# docker system prune

[root@master1 ~]# docker rm -f $(docker ps -aq)

[root@master1 ~]# docker rmi -f $(docker images -q)

2.     If the issue persists, log in to the Matrix platform and rebuild the node where the faulty node resides.

 


Troubleshooting failure of PXC Pods because of server power cycle, network disconnection, or switch from standalone mode to cluster mode

Failed to start the PXC database

Symptom

One of the following symptoms might appear:

·     After power cycle of the server, the application services failed to start up. The runtime log prompts that the database connection is abnormal.

·     Failed to log in to the CLI of a master node.

A successful database login is as shown in the following figure:

 

 

Solution

1.     Execute the following command to delete Pods in the database cluster:

kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod

 

 

2.     Execute the kubectl logs -f command to obtain the database container startup log.

¡     In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

¡     In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.

 

 

 

Disk file corruption of PXC database

Symptom

In cluster mode, the PXC database failed to operate correctly after the server power cycled. In addition, the service Pods associated with the database failed to start up even when the PXC database restarted.

Solution

1.     Execute the kubectl logs -f to obtain the startup logs about each container for the PXC database and identify which containers failed to prompt Starting MySQL (Percona XtraDB Cluster) database server.

2.     Execute the following command to stop starting the corrupted pxc-node containers:

kubectl delete -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml

3.     Clear the permanent directories of the corrupted containers. In this example, Containers pxc-node1, pxc-node2, and pxc-node3 are associated with nodes master1, master2, and master3 of Matrix, respectively. The permanent directories for containers pxc-node1, pxc-node2, and pxc-node3 are /var/lib/ssdata/pxc/pxc/{1}, /var/lib/ssdata/pxc/pxc/{2}, and /var/lib/ssdata/pxc/pxc/{3}, respectively. Execute the rm -rf /var/lib/ssdata/pxc/pxc/{1/2/3}/ command to delete the corresponding permanent directory.

As a best practice, move the files in a permanent directory to another directory and delete them after the repair succeeds.

4.     Execute the following command to restart the pxc-node containers that have stopped:

kubectl apply -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml

5.     Execute the kubectl logs -f command to obtain the database container startup log.

In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

 

 

 

PXC database in abnormal state

Symptom

·     Symptom 1:

The services associated with the PXC database can connect to the database but cannot use the database. For example, a service might receive a response like WSREP has not yet prepared node for application use.

·     Symptom 2:

The services associated with the PXC database can connect to the database but cannot use the database. The log might receive a response indicating that it timed out waiting for a lock to be unlocked and you can try to send the request again.

·     Symptom 3:

The services associated with the PXC database can connect to the database but cannot use the database. No response is returned from the database.

Cause analysis

·     Cause analysis for symptom 1:

This symptom might occur if a brain split occurs on the database cluster. Typically, the database cluster can recover to normal automatically. You can access the CLI of a master node to identify the state of the database cluster. If the database cluster has recovered to normal, the query result for the wsrep_local_state_comment, wsrep_ready, and wsrep_incoming_addresses fields are as shown in the following figure:

 

 

If the query result is different, the cluster has not recovered from the brain split. For example:

¡     The value for the wsrep_local_state_comment field might be Initialized or Joining: receiving State Transfer.

¡     The value for the wsrep_ready field might be OFF, which indicates that the current container is not available.

¡     The wsrep_incoming_addresses field does not display IP addresses of all pxc-node containers, which indicates that not all the containers are in the same database cluster.

·     Cause analysis for symptom 2:

This symptom might occur if the database has dead locks. The dead locks might be metadata locks or exclusive locks.

·     Cause analysis for symptom 3:

This symptom might occur if data synchronization across the database cluster failed.

Solution

1.     Execute the following command to delete Pods in the database cluster:

kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod

 

 

2.     Execute the kubectl logs -f command to obtain the database container startup log.

¡     In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.

¡     In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.

 

 

 

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
All Support
  • Become a Partner
  • Partner Resources
  • Partner Business Management
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网