- Released At: 11-05-2023
- Page Views:
- Downloads:
- Table of Contents
- Related Documents
-
|
H3C Unified Platform |
Troubleshooting Guide |
|
|
Document version: 5W102-20230511
Copyright © 2023 New H3C Technologies Co., Ltd. All rights reserved.
No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.
Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.
The information in this document is subject to change without notice.
Contents
Troubleshooting operating systems
Unified Platform unavailable because of cluster node exceptions
Containers get stuck in evicted state because of insufficient disk space
K8s node gets stuck in Not Ready state after a hugepages size change
Cluster network mode change failure
Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade
Troubleshooting cluster expansion failures
Standalone-to-cluster scaling failure with successful configuration rollback
Failed configuration rollback after Standalone-to-cluster scaling failure
Troubleshooting blocked access to Matrix services
The cluster rejects access to any Matrix service if the security policy drops all access requests
Troubleshooting Matrix login failure because of incorrect password
Login failure occurs because of incorrect password
Troubleshooting default route loss
Default routes get lost after the ifconfig command restarts the NIC
Troubleshooting ETCD service failure
Troubleshooting docker service failure
No response after executing docker commands
Troubleshooting failures caused by power cycle of the server or network disconnection
Files in the operating system were lost after power cycle of the server
Files where Matrix is dependent were lost after power cycle of the server
Troubleshooting cluster deployment failure
Cluster deployment failed and a K8SINSTALL-ERROR occurred
Troubleshooting Unified Platform deployment failure
Unified Platform deployment failed because the kubectl exec command failed
Troubleshooting cluster deployment failure in an IPv6 environment
Troubleshooting Unified Platform access failure
Slow request processing caused by ETCD I/O latency
Troubleshooting GlusterFS Pod anomaly after a node reboot or network change
Troubleshooting GlusterFS deployment failure after Matrix uninstallation or rebuilding
GlusterFS deployment fails because the disk or disk partition it uses has residual data
Failed to install components that use GlusterFS storage because storage volumes cannot be deleted
Failed to upgrade components that use GlusterFS storage because glusterd exits
GlusterFS service anomaly occurred after Matrix rebuilding by using the ISO image
Troubleshooting IP editing failure
Troubleshooting image corruption
Failed to start the PXC database
Disk file corruption of PXC database
PXC database in abnormal state
Loss of all contents of PXC database startup file grastate.dat
Introduction
This document provides information about troubleshooting common software and hardware issues with Unified Platform.
General guidelines
To help identify the cause of issues, collect system and configuration information, including:
· Unified Platform version, Matrix version, and operating system version.
· Symptom, time of failure, and configuration.
· Network topology information, including network diagram, port connections, and points of failure.
· Log messages and diagnostic information. For more information about collecting this information, see "Collecting running logs."
· Steps you have taken and the effect.
Collecting running logs
1. Enter the GUI login address of Unified Platform (in the format of http://ucenter_ip_address:30000/central/index.html) in your browser, and then press Enter.
2. Enter the username and password, and then click Login to enter the GUI home page of Unified Platform.
3. Click System > Log Management > Running Logs to enter the Running Logs page. Then, click the global logs or node logs tab and perform the following tasks:
¡ To view logs in the specific directory in a time span, specify the directory and the start date and end data of the time span.
¡ To search for specific logs, enter the log file or directory name in the search box.
¡ To export running logs to the local device, select the specific log files or click Select All, and then click Export.
Figure 1 Running Logs
Contacting technical support
If an issue persists after you perform the troubleshooting procedures in this document, contact H3C Support. When you contact an authorized H3C support representative, be prepared to provide information described in "General guidelines." This information will help the support engineer assist you as quickly as possible.
You can contact H3C Support at [email protected].
Troubleshooting operating systems
H3Linux installation failure
Symptom
All disks are selected by default for disk partitioning in the operating system. However, in the actual use, you might only need to select some disks. For example, if a USB disk is used for operating system installation, you must unselect the disk where the USB disk resides. As shown in Figure 2, when the second disk is unselected or only the first disk is selected, the partitioning scheme that is automatically loaded does not have the /var/lib/etcd partition.
Figure 2 Unselecting the second and third disks
Solution
This symptom might occur if the second disk is not selected during operating system installation for partitioning. This is because the partioning scheme is customized by default on the prerequisite that all disks are selected. If multiple disks are present, the system partition resides on the first disk, the ETCD partition resides on the second disk. If the second disk is not selected, the /var/lib/etcd partition is missing from the partitioning scheme.
To resolve the issue:
If the /var/lib/etcd partition is missing from the partitioning scheme that is automatically loaded, manually mount the partition as follows:
1. Click . In the
window that opens, specify the /var/lib/etcd mount point, and set the expected capacity to 50GiB.
2. Click Add Mount Point.
3. Configure the partition settings, and then click OK.
Troubleshooting cluster nodes
Node hardware failure
Symptom
A node in the cluster fails and needs to be replaced.
Solution
If a node in the cluster fails because of a hardware error and cannot be recovered, replace it with a new server.
1. Configure the replacement node server to have the exact same hostname, network card name, node IP address, username, password, RAID mode, and disk partition setting as the original faulty node.
2. Install on the replacement node server the same version of Matrix software as the cluster nodes. For more information, see H3C Unified Platform Deployment Guide.
3. Log in to the Matrix platform. Click Deploy on the
top navigation bar and then select Cluster from the navigation pane.
Then, click the set icon for the target node,
and select Repair to repair the node.
4. If the issue persists, contact the support.
Unified Platform unavailable because of cluster node exceptions
Symptom
The following issues are all present:
· Failed to log in to Unified Platform.
· On the Matrix platform, a master node is in red and you cannot ping the abnormal node.
· Pods in running state reside on the abnormal node.
· Execute the kubectl get endpoints -nservice-software itom-central-login-svc command to identify the endpoints for the itom-central-login service. If the Pod IP addresses on the abnormal node remain on these endpoints, an exception occurs.
Figure 3 Identifying endpoints for the itom-central-login service
Solution
1. Access the CLI of the abnormal node and execute the kubectl drain nodeName --ignore-daemonsets --force --delete-local-data --timeout=1800s command to remove all Pods from the abnormal node. The nodeName argument represents the name of the abnormal node.
2. Execute the kubectl delete node nodeName command to remove the abnormal node. The nodeName argument represents the name of the abnormal node.
3. Repair the nodes that are abnormally disconnected. If the server hardware failure cannot be recovered, replace the server.
4. Log in to the Matrix platform. Click Deploy on the
top navigation bar and then select Cluster from the navigation pane.
Then, click the set icon for the target node,
and select Rebuild to rebuild the node.
5. If the issue persists, contact the support.
Containers get stuck in evicted state because of insufficient disk space
Symptom
If the kubectl get pods --all-namespaces command is executed on a node with full disk space, a large number of containers on the node enters evicted state. These containers stay in the state even if disk space is manually released.
Solution
This symptom might occur if the disk space is insufficient on a node. In this case, the K8s mechanism generates a large number of containers in evicted state.
To resolve the issue:
1. Release disk space in the root partition to reduce disk usage.
2. Log in to the Matrix platform. Click Deploy on the
top navigation bar and then select Cluster from the navigation pane.
Then, click the set icon for the target node,
and select Repair to repair the node. The K8s mechanism
will automatically delete containers in evicted state.
3. If the issue persists, contact the support.
K8s node gets stuck in Not Ready state after a hugepages size change
Symptom
After you change the hugepages size, for example, changing the value of the GRUB_CMDLINE_LINUX parameter in file /etc/default/grub from "crashkernel=auto rhgb quiet default_hugepagesz=2M hugepagesz=2M hugepages=8192" to "crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16", the K8s node gets stuck in Not Ready state even after the system restarts.
Solution
Kubelet does not support huge pages of different sizes. When you change the size of huge pages, a new hugepages size directory is created in /sys/kernel/mm/hugepages/, where the previous hugepages size directory is still retained. This can cause K8s node state synchronization failure and therefore causes the K8s node to get stuck in Not Ready state.
This issue can be resolved by setting the number of huge pages of one size to 0. This section sets the number of 2M huge pages to 0.
To resolve the issue:
1. Edit the hugepages configuration file.
a. Use a vi editor to open the configuration file.
[root@node1 ~]# vi /etc/default/grub
b. Press i to enter edit mode and configure settings in the file as follows:
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet default_hugepagesz=1G
hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"
GRUB_DISABLE_RECOVERY="true"
c. Press ESC to exit edit mode.
d. Enter :wq and press Enter to save the changes.
2. Save configuration changes and restart the server for the changes to take effect.
¡ If the server started up in UEFI mode, perform the following tasks:
[root@node1 ~]# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f
Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img
Done
[root@node1 ~]# reboot
¡ If the server started up in Legacy mode, perform the following tasks:
[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-862.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-862.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-f2e062c5077847ae837b2f1cdb91104f
Found initrd image: /boot/initramfs-0-rescue-f2e062c5077847ae837b2f1cdb91104f.img
Done
[root@node1 ~]# reboot
3. Verify that the number of 2M huge pages has been changed to 0.
[root@node1 ~]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=UUID=f47e3128-e888-499e-b370-2b381b6f3134 ro crashkernel=auto rhgb quiet default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0
4. If the issue persists, contact the support.
Cluster network mode change failure
Symptom
The network mode change for the cluster fails from the Matrix Web interface because of ETCD service exceptions on the master node.
Solution
The system initiates two ETCD requests for a network mode change, one for the change in calico and the other for the change on the Web interface. The Web interface prompts mode change failure if either request fails.
To resolve the issue:
1. Verify that the ETCD service has recovered. If the service has not recovered, contact the support.
[root@name1 1.0.0]# etcdctl cluster-health
member fb58b3b32bac01c is healthy: got healthy result from http://10.99.212.82:2379
member aa6e53b313aa741f is healthy: got healthy result from http://10.99.212.81:2379
member d1fcbe1f6db25390 is healthy: got healthy result from http://10.99.212.83:2379
2. If the network mode on the Web interface has been changed but the system prompts failure, change the network mode back to the previous mode. Make sure the mode has been changed back on the Web interface and then proceed to the next step.
3. Try again to set the target network mode.
Exception on kube-apiserver, kube-scheduler, or kube-controller-manager after Matrix upgrade
Symptom
A node is in red after Matrix upgrade and the node details indicate that a kube-apiserver, kubeScheduler, or kubeControllerManager exception has occurred. After accessing the CLI of the abnormal node and executing the kubectl get pod -A -owide command, the command out shows that the abnormal nodes has Pods in CrashLoopBackOff state.
Solution
The solution for the failure varies by situation.
Situation 1
· Symptom:
When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding service ports are occupied and in LISTEN state.
· Cause analysis:
If the old processes do not exit after Matrix upgrade and kube-apiserver port 6443, kube-scheduler port 10251, or kube-controller-manager port 10252 is not released, new Pods cannot start up. You can access the CLI of the node and execute kubectl logs -n kube-system $pod_name or docker logs $container_id to obtain log information about the occupied ports.
· Solution:
The troubleshooting procedures are similar for the kube-scheduler, kube-apiserver, and kube-controller-manager services. This example troubleshoots the kube-scheduler service as follows:
a. Remove the kube-scheduler Pod.
[root@name ~]# mv /etc/kubernetes/manifests/kube-scheduler.yaml /opt/
b. Verify that all kube-scheduler containers have exited. If a containers does not exit after a long while, execute the docker rm -f $container_id to forcedly remove the container or execute the systemctl restart docker to restart the docker service.
[root@name ~]# docker ps | grep kube-scheduler
c. Execute the netstat -anlp | grep -w 10251 to verify that the port is released. The port is released if no connection in LISTEN state is displayed from the command output.
d. Start up the kube-scheduler Pod.
[root@name ~]# mv /opt/kube-scheduler.yaml/etc/kubernetes/manifests/
e. Execute the kubectl get pod -n kube-system -o wide command to obtain the Pod status.
f. If the issue persists, contact the support.
Situation 2
· Symptom:
When you execute netstat -anlp | grep -w 6443, netstat -anlp | grep -w 10251, or netstat -anlp | grep -w 10252 on the node where the abnormal Pods reside, the corresponding ports are occupied and only connections in TIME_WAIT state exist. In addition, the ports are not occupied by the kube-apiserver, kube-scheduler, or kube-controller-manager process.
· Cause analysis:
If the kube-apiserver, kube-scheduler, or kube-controller-manager Pod restarts during Matrix upgrade, port 6443, 10251, or 10252 might be occupied by GlusterFS. In this case, a Pod exception occurs.
· Solution:
Contact the support.
Error occurrence on Pods including calico-node with error message Delegation not available for unit type
Symptom
After you edit the IP address of a node on the Matrix platform, the node is in red. Access the CLI of the abnormal node, execute the kubectl get pod -A -owide command and the command output displays that some Pods including calico-node and calico-kube-controller are abnormal.
The kubelet logs are as follows:
Error syncing pod 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b ("calico-node-vlpz8_kube-system(991e112f-c3a3-4c46-9a9b-dfde4ca0a27b)"), skipping: failed to ensure that the pod: 991e112f-c3a3-4c46-9a9b-dfde4ca0a27b cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod991e112f-c3a3-4c46-9a9b-dfde4ca0a27b] : Delegation not available for unit type
Solution
This symptom might occur if the containerd version is lower than containerd-v1.3.0. To obtain the containerd version, execute the containerd -v command from the CLI of the node.
To resolve the issue, execute the systemctl restart kubelet.service command on the node where an abnormal Pod resides to restart the kubelet service.
Troubleshooting cluster expansion failures
Standalone-to-cluster scaling failure with successful configuration rollback
Symptom
Standalone-to-cluster scaling fails. The original node is displayed in correct state on GUI. The new nodes are displayed in deployment failure state. The Start Deployment button is still available.
Analysis
If standalone-to-cluster scaling fails, the nodes will automatically roll back to standalone mode. If the Start Deployment button is still available, the rollback is successful.
Solution
To resolve the issue:
1. Use either of the following methods to deploy the cluster again:
¡ Log in to Matrix, access the Deploy > Clusters page, and then click Start Deployment.
¡ Log
in to Matrix, access the Deploy
> Clusters page, click
the set icon for the failed nodes to delete the nodes.
Then, click the plus icon to add the deleted nodes, and then click Start Deployment.
When the deployment progress of each node reaches 100%, the deployment succeeds.
2. If the issue persists, contact the support.
Failed configuration rollback after Standalone-to-cluster scaling failure
Symptom
Standalone-to-cluster scaling fails. The original node is displayed in correct state on GUI. The new nodes are displayed in deployment failure state. The Start Deployment button is unavailable.
Analysis
If standalone-to-cluster scaling fails, the nodes will automatically roll back to standalone mode. If the Start Deployment button is unavailable, the rollback fails.
Solution
To resolve the issue:
1. Uninstall the Matrix packages from all the master nodes.
2. Re-install the Matrix packages on the original standalone node.
3. Log in to the Matrix platform and restore the cluster and applications.
4. If the issue persists, contact the support.
Troubleshooting blocked access to Matrix services
The cluster rejects access to any Matrix service if the security policy drops all access requests
Symptom
If you enable a security policy with the default action set to DROP and all rules removed, the cluster rejects all access requests to any Matrix services.
Solution
This symptom occurs because the default action is set to DROP and the default rule that permits access to Matrix port 8443 is removed.
To resolve the issue:
1. Log in to a master node from the management port.
2. Access the disaster recovery directory.
[root@node1 ~]# cd /opt/matrix/k8s/disaster-recovery/
3. Restore the default security policy settings.
[root@node1 ~]# bash recover-security-policies.sh
4. If the issue persists, contact the support.
Troubleshooting Matrix login failure because of incorrect password
Login failure occurs because of incorrect password
Symptom
An admin user fails to log in to Matrix because of incorrect password.
Solution
To resolve the issue:
1. Access the script directory on a master node and execute the bash script_name reset_password command to run the password reset script. The script_name argument represents the script name and the reset_password argument represents the new password. The script name varies by matrix operating status. In this example, the new password is Pwd@123456.
¡ If the matrix operates correctly, run the password reset script named resetMatrixUserPassword.sh.
[root@node1 ~]# cd /opt/matrix/k8s
[root@node1 k8s]# bash resetMatrixUserPassword.sh Pwd@123456
+ MATRIX_ADMIN_PASSWD=Pwd@123456
++ curl -k -g -X POST -H Content-Type:application/json -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/usermanage/reset_password
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 110 0 84 100 26 251 77 --:--:-- --:--:-- --:--:-- 252
+ return_info='{"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600}'
+ [[ {"token":"3ac4fd9b-35d7-4f66-97b0-2b4ef0a368d1","username":"admin","expireTime":600} =~ admin ]]
+ echo 'Password reset succeeded.'
Password reset succeeded.
¡ If the matrix cannot operate correctly, run the emergency password reset script named resetMatrixUserPassword_emergency.sh.
[root@node1 ~]# cd /opt/matrix/k8s
[root@node1 k8s]# bash resetMatrixUserPassword_emergency.sh Pwd@123456
+ MATRIX_ADMIN_PASSWD=Pwd@123456
++ curl -k -g -X POST -H Content-Type:application/json -H X-Access-Mode:emergen cy -d '{"password": "Pwd@123456"}' https://localhost:8443/matrix/rsapi/v1.0/use rmanage/reset_password
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 110 0 84 100 26 253 78 --:--:-- --:--:-- --:--:-- 253
+ return_info='{"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admi n","expireTime":600}'
+ [[ {"token":"d90753f5-cd2c-4c1c-b178-45cdb18c6261","username":"admin","expire Time":600} =~ admin ]]
+ echo 'Password reset succeeded.'
Password reset succeeded.
2. Log in to Matrix again by using the new password.
3. If the issue persists, contact the support.
Troubleshooting default route loss
Default routes get lost after the ifconfig command restarts the NIC
Symptom
The configured default routes get lost after you execute the ifconfig command on a node in the cluster to restart the NIC of the node.
Solution
To resolve the issue:
1. Access the CLI of the node and then use the systemctl restart network command to restart the network service.
[root@node01 ~]# systemctl restart network
2. Use the route -n command to verify that the default routes have been restored. The example output from this command is as follows:
[root@node01 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.99.212.1 0.0.0.0 UG 0 0 0 eth0
10.99.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 eth0
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
3. If the issue persists, contact the support.
Troubleshooting ETCD service failure
ETCD service startup failure
Symptom
ETCD service startup failure occurs because of ETCD storage data file corruption or file loss when a node is power cycled.
· After a node is power cycled, the ETCD service on it fails to start up because of database file corruption. This event eventually causes the cluster to malfunction.
The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd/ directory:
panic: freepages: failed to get all reachable pages (page 1407374894039040: out of bounds: 1264)
goroutine 116 [running]:
panic(0x55a1d6cce4a0, 0xc420202ef0)
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:551 +0x3c5 fp=0xc42006bf60 sp=0xc42006bec0 pc=0x55a1d5f0ae25
github.com/coreos/bbolt.(*DB).freepages.func2(0xc42020c180)
...
· Data restoration failure causes file corruption.
Typically, the snap log file in the /var/lib/etcd/default.etcd/member/ directory must contain at least one log index value greater than the smallest log index value in the wal log file.
If all log index values in the wal log file are greater than those in the snap log file (as shown in Figure 4), the key operation log data will get lost and the ETCD service cannot recover data. In this case, file corruption occurs.
Figure 4 Viewing log index values in snap and wal log files
· After a node is power cycled, the ETCD service on it fails to start up because of database snapshot file loss. This event eventually causes the cluster to malfunction.
The following log is displayed:
etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
· After a node is power cycled, the ETCD service on it fails to start up because of data file corruption. This event eventually causes abnormal node status.
The following log is displayed on the abnormal node:
"error":"walpb: crc mismatch"
Solution
To resolve the issue:
1. Log in to each node, and execute the systemctl status etcd command to verify that the ETCD service is in running state.
[root@node01 ~]# systemctl status etcd
2. If the database file for the ETCD service of only one node is corrupted, perform the following steps:
a. Log in to Matrix.
b. Click the icon
in the upper right corner of the target node on the Deploy > Clusters > Deploy Cluster page,
and then select Rebuild to rebuild the node.
3. If the database file for the ETCD service is corrupted on two nodes, Matrix enter the emergency mode. You can resolve the issue by rebuilding the nodes one after another.
4. If the database file for the ETCD service is corrupted on a standalone node or a three-node cluster, use one of the following methods:
¡ Method 1: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."
¡ Method 2:
- Uninstall Matrix from all nodes.
- Reinstall Matrix on all the nodes.
- Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.
Solution for a standalone node
1. Log in to the node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.
[root@master1 ~]# systemctl status etcd
2. Stop the Matrix service on the node.
¡ For a root account, execute the systemctl stop matrix command.
[root@master1 ~]# systemctl stop matrix
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"
3. Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service.
[root@master1 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix
4. Stop the ETCD service.
¡ For a root account, execute the systemctl stop etcd command to stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.
[root@master1 ~]# systemctl stop etcd
[root@master1 ~]# rm -rf /var/lib/etcd/default.etcd/
[root@master1 ~]# ll /var/lib/etcd/
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"
[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"
[admin@node4 ~]$ ll /var/lib/etcd/
5. Access the directory where the ETCD service restoration script resides.
[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/
6. Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.
¡ For a root account, execute the following commands:
[root@master1 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db
2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069
2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b
¡ For a non-root account, execute the following commands:
[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091547.db
2021-08-06 03:16:19.500144 I | mvcc: restore compact to 109069
2021-08-06 03:16:19.506086 I | etcdserver/membership: added member 91651d28c8465c86 [http://10.99.212.125:2380] to cluster db6c09f0e7b9702b
7. Restart the ETCD service.
¡ For a root account, execute the systemctl restart etcd command.
[root@master1 ~]# systemctl restart etcd
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"
8. Restart the Matrix service.
¡ For a root account, execute the systemctl restart matrix command.
[root@master1 ~]# systemctl restart matrix
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"
9. Restore the kube-apiserver service.
[root@master1 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/
10. Verify that the issue has been resolved.
a. Use the VIP to log in to the GUI of the Matrix platform.
b. On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status.
11. On the top navigation bar, click Deploy, and then select Pods from the menu to view the running status of the Pods.
Figure 5 All pods in running status
Solution for a three-node cluster
1. Log in to each master node. Then, execute the systemctl status etcd command to view the ETCD service status. If the ETCD status is not running, go to the next step.
[root@master2 ~]# systemctl status etcd
2. Stop the Matrix service on each master node.
¡ For a root account, execute the systemctl stop matrix command.
[root@master2 ~]# systemctl stop matrix
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop matrix" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop matrix"
3. Execute the mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix command to stop the kube-apiserver service on each master node.
[root@master2 ~]# mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/matrix
4. Stop the ETCD service on each master node.
¡ For a root account, execute the systemctl stop etcd command to stop the ETCD service and then execute the rm -rf /var/lib/etcd/default.etcd/ command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.
[root@master2 ~]# systemctl stop etcd
[root@master2 ~]# rm -rf /var/lib/etcd/default.etcd/
[root@master2 ~]# ll /var/lib/etcd/
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl stop etcd" command to stop the ETCD service and then execute the sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/" command to remove the ETCD data directory. Make sure no data directories exist in the /var/lib/etcd directory.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl stop etcd"
[admin@node4 ~]$ sudo /bin/bash -c "rm -rf /var/lib/etcd/default.etcd/"
[admin@node4 ~]$ ll /var/lib/etcd/
5. Access the directory where the ETCD service restoration script resides.
[root@master1 ~]# cd /opt/matrix/k8s/disaster-recovery/
6. Execute the ETCD service restoration script with a new backup file from the specified ETCD backup directory /opt/matrix/backup/etcd_backup_snapshot/. If the backup file does not exist in the backup directory, an error will be reported.
For data consistency, make sure all nodes use the same backup file. If a node does not have the file, copy one from another node.
¡ For a root account, execute the following commands:
[root@master2 ~]# bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805091653.db
2021-08-06 06:33:14.788657 I | mvcc: restore compact to 273930
2021-08-06 06:33:14.802137 I | etcdserver/membership: added member 312131d4535cc53f [http://10.99.212.124:2380] to cluster cd6d5adc1bfd16f5
2021-08-06 06:33:14.802189 I | etcdserver/membership: added member 5fc2f82d74297956 [http://10.99.212.123:2380] to cluster cd6d5adc1bfd16f5
2021-08-06 06:33:14.802206 I | etcdserver/membership: added member ad12c65048f444bd [http://10.99.212.120:2380] to cluster cd6d5adc1bfd16f5
¡ For a root account, execute the following commands:
[admin@node4 ~]$ sudo bash etcd_restore.sh Etcd_Snapshot_V900R001B06D012_20210805014548.db
2021-08-06 01:22:10.876952 I | mvcc: restore compact to 12660679
2021-08-06 01:22:10.906116 I | etcdserver/membership: added member ac2cefc4cae84e25 [http://[2000::100:2000]:2380] to cluster ced7b5d5ee633b40
2021-08-06 01:22:10.906174 I | etcdserver/membership: added member b4689a44b8c1f191 [http://[2000::100:2001]:2380] to cluster ced7b5d5ee633b40
2021-08-06 01:22:10.906197 I | etcdserver/membership: added member c328a554c1ca84f4 [http://[2000::100:2002]:2380] to cluster ced7b5d5ee633b40
7. Restart the ETCD service on each master node.
¡ For a root account, execute the systemctl restart etcd command.
[root@master2 ~]# systemctl restart etcd
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart etcd" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart etcd"
8. Restart the Matrix service on each master node.
¡ For a root account, execute the systemctl restart matrix command.
[root@master2 ~]# systemctl restart matrix
¡ For a non-root account, execute the sudo /bin/bash -c "systemctl restart matrix" command.
[admin@node4 ~]$ sudo /bin/bash -c "systemctl restart matrix"
9. Restore the kube-apiserver service on each master node.
[root@master2 ~]# mv /opt/matrix/kube-apiserver.yaml /etc/kubernetes/manifests/
10. Verify that the issue has been resolved.
a. Use the VIP to log in to the GUI of the Matrix platform.
b. On the top navigation bar, click Deploy, and then select Cluster from the left navigation pane to view the master node status.
11. On the top navigation bar, click Deploy, and then select Pods from the menu to view the running status of the Pods.
Figure 6 All pods in running status
12. If the issue persists, contact the support.
ETCD client request times out or primary/standby switchover occurs frequently in the ETCD cluster when ETCD does not have an exclusive use of a disk
Symptom
· An ETCD client, such as a K8s or Matrix client, takes more than 800 ms to access the ETCD database. The following information is displayed in the etcd.log file in the /var/log/matrix-diag/Matrix/etcd directory on each master node.
2020-11-15 12:36:42.013987 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:295" took too long (877.352309ms) to execute
2020-11-15 12:36:54.026221 W | etcdserver: read-only range request "key:\"/registry/pods/base-service/\" range_end:\"/registry/pods/base-service0\" " with result "range_response_count:42 size:107232" took too long (1.767232614s) to execute)
…
· Primary/standby switchover occurs frequently in the ETCD cluster, which might be caused by a heartbeat timeout.
Solution
This issue typically results from disk I/O performance degradation in a shared storage environment.
To resolve the issue:
1. If the symptom occurs during execution of an application installation, upgrade, or configuration deployment task and results in task execution failure, perform the task again to fix the issue.
The system does not remove data synchronized during the execution of an installation, upgrade or configuration deployment task. Task re-execution will be likely to succeed because the amount of data to be synchronized decreases, causing less impact on disk I/O performance.
2. If the symptom occurs during normal operation of the system, edit the matrixLeaderLeaseDuration (lease aging time) and matrixLeaderRetryPeriod (lease detection interval) parameters to delay the timeout for primary/backup switchovers.
However, this change will increase the failover time.
3. If writing failure or data loss occurs because of poor disk I/O performance, use one of the following methods to fix the issue:
¡ If a Pod has a status or communication anomaly, use the kubectl delete pod -n namespace podName command to remove the Pod. The system will create a Pod automatically to restore the ETCD data source.
¡ Method 2: Use the procedure in "Solution for a standalone node" or "Solution for a three-node cluster."
¡ Method 3:
- Uninstall Matrix from all nodes.
- Reinstall Matrix on all the nodes.
- Log in to Matrix to restore the cluster and applications by using the backup files. For more information, see backup and restoration in Unified Platform Deployment Guide.
Troubleshooting docker service failure
No response after executing docker commands
Symptom
No response is returned a long time after you execute the docker ps, docker images, docker inspect, and docker rmi commands.
Solution
1. Restart the docker service.
¡ For the root user, execute the following command to restart the docker service:
[root@master1 ~]# systemctl restart docker
¡ For a non-root user, execute the following command to restart the docker service:
[admin@master1 ~]$ sudo /bin/bash -c "systemctl restart docker"
2. Verify that the docker service becomes normal.
¡ For the root user, execute the docker images command to view the docker service.
¡ For a non-root user, execute the sudo /bin/bash -c " docker images " command to view the docker service.
The docker service becomes normal if the command output displays the image information of the current node.
Troubleshooting failures caused by power cycle of the server or network disconnection
Files in the operating system were lost after power cycle of the server
Symptom
The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):
· Contents in the chronyd.service, docker.service, and containerd.service files in the /usr/lib/systemd/system directory were lost.
· Contents in the chrony.conf, docker, etcd, hosts, and ssh configuration files in the /etc/ directory were lost. The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.
· The log files or contents in the log files in the /var/log directory were lost.
Solution
To resolve the issue:
· The contents in the chronyd.service, docker.service, and containerd.service files were lost:
a. Execute the ls /usr/lib/systemd/system/service-name.service command to identify whether the service file exists on each node or whether no information exists in the file.
b. If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.
c. If the file does not exist in any node, contact the support or re-install the operating system.
· The files or contents in the log files in the /etc/ and /var/log directories were lost:
Contact the support or re-install the operating system.
· The deployenv.sh file in the /opt/matrix/k8s/ directory was lost.
In cluster mode, copy the deployenv.sh file from another Master node that has that file. If no master nodes have that file, rebuild the node. In standalone mode, contact the support or reinstall Matrix.
Files where Matrix is dependent were lost after power cycle of the server
Symptom
The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):
· The service file or contents in the service file for the etcd or matrix service were lost:
· The configuration files, for example, navigator_config.json or contents in the files in the /opt/matrix/ directory were lost.
· The script files, for example, docker.sh, or contents in the files in the /opt/matrix/ directory were lost.
· The Docker image files in the /var/lib/docker directory were damaged.
Symptom 1: Some pods were in ImagePullBackOff state. The event log for the describe pod is as follows:
error creating overlay mount to /var/lib/ /overlay2/698028ac124c9d0ef831f7d2d9506acd01faddaae6ea06a0a169fb352e0eddf4/merged: too many levels of symbolic links
Symptom 2: The following message was displayed:
time="2021-05-10T18:05:50.518918884+08:00" level=error msg="Handler for GET /containers/2494c1172314e37bd8250be06a24e0636b7427f89b3b5a5398ecfad7c2fe171d/json returned error: readlink /var/lib/docker/overlay2/l: invalid argument"
· The YAML file or contents in the YAML file in the /opt/matrix/ directory were lost.
Solution
To resolve the issue:
· The service file or contents in the service file were lost, or the files or contents in the files in the /opt/matrix/ directory were lost:
a. Execute the ls command to identify whether the service file exists on each node or whether no information exists in the file.
b. If the service file exists on some nodes and contains correct information, execute the scp command to copy the file to the nodes where the file does not exist or does not contain any information.
c. If the file does not exist in any node, contact the support or re-install Matrix.
· The Docker image files in the /var/lib/docker directory were damaged:
a. Upload the Matrix package to rebuild the node.
b. Contact the support.
The node is in red or yellow or some Pods are in CreateContainerError state on the GUI of the Matrix platform after power cycle of the server
Symptom
The following issues occurred when the server where Matrix is deployed was power cycled during the Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):
· The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.
· Some Pods in CreateContainerError state on the Monitor > Pods page, indicating that product-related Pods are abnormal.
To view all Pods in CreateContainerError state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep CreateContainerError command.
[root@node1 home]# kubectl get pod -A -owide | grep CreateContainerError
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-cd96b6c89-hfz7s 0/1 CreateContainerError 0 29d 10.99.212.164 node1 <none> <none>
Solution
To resolve the issue, use either of the following methods:
· Method 1:
a. Access the node where the abnormal Pod resides and execute the docker ps | grep podname | grep -v POD | grep Up|awk '{print $1}' command to obtain the ID of the container in up state. Podname represents the name of the abnormal Pod.
[root@node1 home]# docker ps |grep calico-kube-controllers-cd96b6c89-hfz7s | grep -v POD|grep Up|awk '{print $1}'
c755b7812380
b. Execute the docker stop containerid && docker rm containerid command, for example, docker stop c755b7812380 && docker rm c755b7812380, to delete the container in up state.
c. Execute the kubectl get pod -A -owide | grep CreateContainerError command to check for Pods in CreateContainerError state. If such Pods still exist, log in to the GUI of the Matrix platform to rebuild the node.
· Method 2: Log in to the GUI of the Matrix platform and rebuild the node where the abnormal Pod resides.
The node is in red or yellow or some Pods are in Error state on the GUI of the Matrix platform after power cycle of the server
Symptom
The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):
· The node is in red or yellow on the Deploy > Clusters page, indicating that Matrix-related Pods are abnormal.
· Some Pods in Error state on the Monitor > Pods page, indicating that product-related Pods are abnormal.
To view all Pods in Error state, access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command.
You can also access the CLI of the node where the Pod in Error state resides and execute the cat /var/log/matrix-diag/Matrix/kubelet/kubelet.log | grep "unexpected end of JSON input" command to view the kubelet logs about the node. If the following error message is displayed, Pod data is corrupted because of power cycle of the node and the Pod cannot start up.
Multus: failed to load netconf: unexpected end of JSON input
Solution
To resolve the issue, use either of the following methods:
· Method 1 (applicable to scenarios where a small number of Pods are in Error state):
a. Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error command to view the namespaces and names of Pods in Error state.
b. Execute the kubectl delete pod -n namespace podName command to delete a Pod in Error state.
To delete multiple Pods in Error state, execute the command multiple times.
· Method 2 (applicable to scenarios where a large number of Pods are in Error state):
Access the CLI of any master node and execute the kubectl get pod -A -owide | grep -w Error| awk '{print $1 " " $2}'| xargs kubectl delete pod –n command. This command deletes all Pods in Error state.
The node is in yellow and Grafana is in abnormal health status on the GUI of the Matrix platform after power cycle of the server
Symptom
The following issues occurred when the server where Matrix is deployed was power cycled during Matrix running process or cluster/application deployment process (for example, cluster deployment, upgrade, restoration, rebuilding, or application deployment or upgrade):
· The node is in yellow on the Deploy > Clusters page, and the Grafana health status is displayed abnormal on the node detailed information page.
· Access the CLI of the node in yellow and execute the kubectl get pod -n kube-system -o wide | grep grafana command to view the pod status. The Grafana pod is displayed in CrashLoopBackOff status.
· Access the CLI of the node in yellow and execute the ll /opt/matrix/k8s/conf/grafana command to view the total number of files in the Grafana configuration file folder. The total number is displayed as 0.
Solution
Power cycle of the server caused loss of the Grafana configuration file. To resolve the issue:
1. Execute the sh /opt/matrix/k8s/monitor/preInstallGrafana.sh command to create the Grafana configuration file again.
2. Execute the kubectl delete pod -n kube-system grafana-xxx command to delete and reboot pods in CrashLoopBackOff state.
3. Make sure the configuration file exists in the Grafana configuration folder of all nodes in the cluster.
Troubleshooting cluster deployment failure
Cluster deployment failed and a K8SINSTALL-ERROR occurred
Symptom
Cluster deployment failed. The node logs
page accessed by clicking the setting icon for
the node displays K8SINSTALL-ERROR.
Solution
This issue might occur if the node has multiple NICs in up state.
If a NIC is not configured with an IP address:
By default, the arp_ignore parameter is set to 0 in an operating system and the system responds to all ARP requests for the local IP address, including requests for the loopback adapter's address. The matrix node might encapsulate the MAC address of the NIC not configured with an IP address in ARP responses, causing connection errors between cluster nodes.
To resolve the issue:
1. Use the ifconfig command to view the NIC order when deploying, upgrading, or rebuilding a cluster in a scenario where multiple NICs exist.
2. Make sure all physical NICs before the NIC whose IP is used as the node IP in the Matrix cluster have IPs assigned or the ONBOOT value set to no. For example, if the node uses the IP of NIC ens191 as the node IP in the Matrix cluster, and ens190 is before ens191 in order, make sure ens190 has an IP assigned.
3. Make sure no abnormal NIC configuration file exists in the cluster. For example, the ONBOOT value is yes when the NIC is not connected or not configured with an IP.
4. If the cluster uses a bonding interface as the node IP in the Matrix cluster, make sure all physical NICs not the members of the bonding interface have IPs assigned or the ONBOOT value set to no.
5. Reboot the network service.
Troubleshooting Unified Platform deployment failure
Unified Platform deployment failed because the kubectl exec command failed
Symptom
The system failed to deploy Unified Platform. The log shows that gfs volume creation failed because one node failed to execute the kubectl exec command (the command output reports an error). If you execute the kubectl exec -it pod bash command on that node, all Pods on the node are inaccessible.
Solution
To resolve the issue:
1. Log in to the CLI of the node that failed to execute the kubectl exec command.
2. Execute the systemctl restart kubelet.service command to restart the kubelet service on the node.
3. If the issue persists, contact the support.
Troubleshooting cluster deployment failure in an IPv6 environment
Cluster deployment failed in an IPv6 environment when a vNIC was added to a node or an IP address was added to a vNIC
Symptom
The rebuild or upgrade operation fails for a cluster if you add a vNIC to a node or add a new IP address to the vNIC on that node, because the IP address of other nodes in this cluster and this new IP address are not on the same subnet. The system prompts ping failure if you execute the ping6 pod_ip command at the CLI of that node. The pod_ip argument represents the IP address of the container. You can use the kubectl get pod -n kube-system -o wide command to obtain that IP address.
Solution
To resolve the issue:
1. Change the new IP address to an IP address that is on the same subnet as other nodes.
2. Configure a routing policy on other nodes so that these nodes can communicate with each other.
3. If the issue persists, contact the support.
Troubleshooting Unified Platform access failure
Slow request processing caused by ETCD I/O latency
Symptom
Failed to access Unified Platform.
When you review the ETCD log, the following prompt is displayed:
context deadline exceeded, waiting for ReadIndex response took too long, retrying,
When you review the apiserver log and the following prompt is displayed:
stopped listening on [::]:6443
Solution
This symptom might occur when the API server failed to obtain data from ETCD multiple times because of ETCD latency, and then stopped listening to port 6443. In this case, the service components failed to use the K8s API through port 6443.
To resolve the issue:
1. Verify that the disk I/O performance can meet the requirements. If the average I/O performance value is equal to or greater than 10000, the disk I/O performance meets the requirements. If the average I/O performance value is smaller than 10000, an error has occurred on the disk I/O performance, and you must improve the disk I/O performance.
To test the disk I/O performance:
¡ For the root user, execute the bash /opt/matrix/tools/env_check.sh –p command.
¡ For a non-root user, execute the sudo bash /opt/matrix/tools/env_check.sh -p command.
2. Execute the kubectl get pod -n service-software | grep stolon-keeper command to obtain the names of all stolon-keeper Pods.
3. Execute the kubectl delete pod -n service-software pod_name command to restart the stolon-keeper Pods one by one.
4. Access Unified Platform after all stolon-keeper Pods recover to the running state.
Troubleshooting GlusterFS Pod anomaly after a node reboot or network change
Files with the garbled ??? filename are displayed in the mounted directory on the host or service container and the files are not accessible
Symptom
??? files are displayed when the ls –l command is executed in the GFS mounted directory on the host or service container, and the files are not accessible.
Cause analysis
The remaining disk space on the nodes where the three replicas of the GlusterFS storage volume reside is inconsistent, resulting in data inconsistency between the three replicas after data writing. As a consequence, the GlusterFS storage volume data file has a brain split.
Solution
To resolve the issue:
1. Execute the kubectl get po -A |grep glusterfs command to view the names and namespaces of the GlusterFS Pods.
2. Execute the kubectl exec command to access the GlusterFS container and then execute the gluster volume heal VolumeName info command to view whether words Is in split-brainis are displayed in the command output and record the file paths. VolumeName is the name of the faulty storage volume. To obtain the storage volume name, execute the kubectl exec –it {gfs pod namespace+gfs pod name} - gluster volume list | grep {service data volume name} command.
3. Use one of the following methods to resolve the issue:
¡ To resolve the issue based on the file size, execute the gluster volume heal VOLNAME split-brain bigger-file filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the largest file.
¡ To resolve the issue based on the update time of the file, execute the gluster volume heal VOLNAME split-brain latest-mtime filepath command. VOLNAME is the name of the storage volume. filepath is the full path of the most recently generated file.
4. Use the procedures provided at https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/ to resolve the issue.
5. If the issue persists, contact the support.
Troubleshooting GlusterFS deployment failure after Matrix uninstallation or rebuilding
GlusterFS deployment fails because the disk or disk partition it uses has residual data
Symptom
GlusterFS fails to be deployed. The Matrix platform displays a message that the disk or disk partition GlusterFS uses has residual data.
Device include vg , nodename:node1, device:/dev/vda3
Cause analysis
Heketi for GlusterFS requires an empty disk or disk partition for deployment, but the disk or disk partition for GlusterFS deployment contained data. You must clean up the disk manually.
Solution
To resolve the issue:
1. Access the directory where the disk cleanup script resides.
[root@m2 ~]# cd /
opt/matrix/app/install/metadata/gluster/gluster/scripts/tools/
2. Execute the bash clearDisk.sh disks command to clean up disks, where disks represents the disks to clean up. You must enclose the list of disks in double quotation marks. If you are to clean up multiple disks or disk partitions, use spaces to separate them, for example, bash clearDisk.sh "/dev/vdb /dev/vdc".
[root@m2 ~]# bash clearDisk.sh "/dev/vdb /dev/vdc"
[clear_disk] CAUTION: Please confirm whether to erase the disk /dev/vdb /dev/vdc
Continue anyway? (Y/N) : y
[clear_disk] CAUTION: Please confirm whether to clear glusterfs config file
Continue anyway? (Y/N) : y
[clear_disk] Disk erase complete.
CAUTION: Identify the disk or disk partition before cleaning up it. |
3. Repeat the preceding procedure on all the other master modes and then redeploy GlusterFS.
4. If the issue persists, contact the support.
Failed to install components that use GlusterFS storage because storage volumes cannot be deleted
Symptom
The installation of a component that uses GlusterFS storage failed. The Matrix logs show that the system failed to delete storage volumes when the installation script uses the volume.sh script. When you execute commands to delete storage volumes on the server where the active master node resides, an error still occurs and the storage volumes fail to be deleted.
Cause analysis
For installation of a component that uses GlusterFS storage, the installation script will use GlusterFS heketi commands to delete and re-create the GlusterFS storage volume. Due to open source issues, during storage volume deletion, an error message appears indicating that the storage volume is mounted to the operating system. However, the mounting information of the operating system does not display any mounting information about the storage volume that causes the deletion failure of the storage volume.
Solution
To resolve the issue:
1. Access the CLI of each master node and restart the server where the nodes reside. As a best practice, restart the standby master nodes and then the active master node.
2. Deploy the components that use GlusterFS storage after the cluster recovers to normal.
Failed to upgrade components that use GlusterFS storage because glusterd exits
Symptom
1. When the upgrade of a component that uses GlusterFS storage failed, access the CLI of the node and review the Matrix logs. The logs show that the upgrade of the component that uses GlusterFS storage has failed multiple times.
2. Access the CLI of a master node, and execute the kubectl get po -A -owide | grep glusterfs command to obtain the names of all the GlusterFS Pods in running state.
[root@matrix ~]# kubectl get po -A -owide | grep glusterfs
glusterfs-example glusterfs-l6fcr 1/1 Running 0 3d23h 10.99.212.200 matrix <none> <none>
glusterfs-example heketi-848f8f7dd6-nc2kq 1/1 Running 0 3d23h 177.177.95.77 matrix <none> <none>
glusterfs-example monitor-84964d7cd7-2wjrr 1/1 Running 0 3d23h 177.177.95.78 matrix <none> <none>
3. Execute the kubectl exec -it -n glusterfs-example glusterfs-l6fcr /bin/bash command to access GlusterFS Pods.
4. Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command, but the glusterd process is not displayed in the command output.
Solution
This symptom might occur if the glusterd process in GlusterFS Pods abnormally exits during upgrade of components that use GlusterFS storage, which causes the failure of executing storage-related scripts.
To resolve the issue:
1. Execute the kubectl get po -A -owide | grep glusterfs command to obtain the names of all GlusterFS Pods in running state, for example, glusterfs-l6fcr.
2. Execute the kubectl exec -it -n glusterfs-example glusterfs-l6fcr /bin/bash command to access containers in the GlusterFS Pod.
3. Execute the systemctl restart glusterd command to restart the glusterd process.
4. Execute the ps -aux | grep /usr/sbin/glusterd | grep -v grep command to identify whether the glusterd process has been restarted.
5. After the glusterd process restarts, upgrade the components that use GlusterFS storage again.
GlusterFS service anomaly occurred after Matrix rebuilding by using the ISO image
Symptom
After a node is rebuilt by reinstallation of the ISO image, no GFS-related data can be found in the lsblk command output on the rebuilt node, and operations such as volume creation fail.
Analysis
The possible causes for this issue include:
· GlusterFS Heketi requires a blank disk or a blank disk partition, but other data exists in the disk or disk partition used by GlusterFS. As a result, GlusterFS data cannot be synchronized to the rebuilt node.
· The partition number of GFS is inconsistent with that on the node before rebuilding.
· The glusterd service in the Glusterfs pod of the rebuilt node is still in an abnormal state when GlusterFS initiates data synchronization.
Solution
1. If other data exists in the disk or disk partition used by GlusterFS, perform the following steps:
a. Enter the directory where the disk cleaning script is stored.
[root@m2 ~]# cd /opt/matrix/app/install/metadata/gluster/gluster/scripts/tools/
b. Execute the bash clearDisk.sh disks command to execute the script, where disks represent disks that need to be cleared. You must use double quotation marks to enclose the disks or disk partitions and separate them by spaces, for example, bash clearDisk.sh "/dev/vdb /dev/vdc".
[root@m2 ~]# bash clearDisk.sh "/dev/vdb /dev/vdc"
[clear_disk] CAUTION: Please confirm whether to erase the disk /dev/vdb /dev/vdc
Continue anyway? (Y/N) : y
[clear_disk] CAUTION: Please confirm whether to clear glusterfs config file
Continue anyway? (Y/N) : y
[clear_disk] Disk erase complete.
CAUTION: Clearing a disk is risky. Identify the disk or partition you are to delete carefully. |
2. Verify that partition information for GFS after rebuilding is consistent with that before rebuilding.
[root@c1 ~]# cat /opt/matrix/app/install/metadata/gluster/gluster/heketi/config/cluster.json
{
"node" : [ {
"nodename" : "c1",
"device" : [ "/dev/vdc" ]
}, {
"nodename" : "c2",
"device" : [ "/dev/vdc" ]
}, {
"nodename" : "c3",
"device" : [ "/dev/vdc" ]
} ]
3. Reboot the glusterfs pod of the rebuilt node, for example, delete the glusterfs pod with hostname c3.
[root@c3 ~]# kubectl get pod -A -owide |grep glusterfs |grep c3
glusterfs-example gfs-exporter-php62 1/1 Running 0 46m 10.99.212.72 c3 <none> <none>
glusterfs-example glusterfs-fh2cc 1/1 Running 0 46m 10.99.212.72 c3 <none> <none>
glusterfs-example heketi-75d6c7db69-vhzh2 1/1 Running 0 26m 177.177.240.5 c3 <none> <none>
glusterfs-example monitor-5f9bd8ccb4-54mrn 1/1 Running 0 26m 177.177.240.4 c3 <none> <none>
[root@c3 ~]# kubectl delete pod -n glusterfs-example glusterfs-fh2cc
pod "glusterfs-fh2cc" deleted
For more recovery procedure, see the remote replication methods for GlusterFS storage volume data files at the GlusterFS official website https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-georep/.
Troubleshooting IP editing failure
Failed to edit the VIPs
Symptom
Failed to edit VIP settings after in the Advance section on the Deploy > Clusters > Cluster Parameters > Modify Cluster Parameters page. The Matrix logs show the following error messages:
2022-02-16T10:33:52,207 | INFO | DeployResource-11-thread-1 | K8sClientHelper.getConfigMapByName:2120 | [K8sClientHelper] get configmap by name param: namespace kube-system, configmapName kube-proxy
2022-02-16T10:33:52,227 | ERROR | DeployResource-11-thread-1 | DefaultUncaughtExceptionHandler.uncaughtException:18 | uncaught exception in Thread[DeployResource-11-thread-1,5,main], stack: [java.lang.Thread.getStackTrace(Thread.java:1559), com.h3c.matrix.util.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:18), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057), java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052), java.lang.Thread.dispatchUncaughtException(Thread.java:1959)]
java.util.ServiceConfigurationError: io.fabric8.kubernetes.api.KubernetesResourceMappingProvider: Provider io.fabric8.kubernetes.internal.InternalResourceMappingProvider not found
Solution
This symptom might occur if Matrix failed to obtain ConfigMaps due to Fabric8 issues.
To resolve the issue, execute the systemctl restart matrix to restart the current node, and then edit VIP settings again.
Troubleshooting image corruption
Image corruption
Symptom
An image corruption error occurs if one of the following symptoms appears:
· Symptom 1:
When a Pod is in ImagePullBackOff state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:
too many levels of symbolic links
· Symptom 2:
When a Pod is in ImageInspectError state and you execute the kubectl describe pod -n namespace podName command to obtain the event log, the command output displays the following:
readlink /var/lib/docker/overlay2/l: invalid argument"
The namespace argument represents the name space of the Pod, and the podName argument represents the name of the Pod.
Solution
To resolve the issue:
1. Execute the following commands in sequence to delete all containers and images on the node where the faulty Pod resides:
[root@master1 ~]# systemctl restart docker
[root@master1 ~]# docker system prune
[root@master1 ~]# docker rm -f $(docker ps -aq)
[root@master1 ~]# docker rmi -f $(docker images -q)
2. If the issue persists, log in to the Matrix platform and rebuild the node where the faulty node resides.
Troubleshooting failure of PXC Pods because of server power cycle, network disconnection, or switch from standalone mode to cluster mode
Failed to start the PXC database
Symptom
One of the following symptoms might appear:
· After power cycle of the server, the application services failed to start up. The runtime log prompts that the database connection is abnormal.
· Failed to log in to the CLI of a master node.
A successful database login is as shown in the following figure:
Solution
1. Execute the following command to delete Pods in the database cluster:
kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod
2. Execute the kubectl logs -f command to obtain the database container startup log.
¡ In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.
¡ In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.
Disk file corruption of PXC database
Symptom
In cluster mode, the PXC database failed to operate correctly after the server power cycled. In addition, the service Pods associated with the database failed to start up even when the PXC database restarted.
Solution
1. Execute the kubectl logs -f to obtain the startup logs about each container for the PXC database and identify which containers failed to prompt Starting MySQL (Percona XtraDB Cluster) database server.
2. Execute the following command to stop starting the corrupted pxc-node containers:
kubectl delete -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml
3. Clear the permanent directories of the corrupted containers. In this example, Containers pxc-node1, pxc-node2, and pxc-node3 are associated with nodes master1, master2, and master3 of Matrix, respectively. The permanent directories for containers pxc-node1, pxc-node2, and pxc-node3 are /var/lib/ssdata/pxc/pxc/{1}, /var/lib/ssdata/pxc/pxc/{2}, and /var/lib/ssdata/pxc/pxc/{3}, respectively. Execute the rm -rf /var/lib/ssdata/pxc/pxc/{1/2/3}/ command to delete the corresponding permanent directory.
As a best practice, move the files in a permanent directory to another directory and delete them after the repair succeeds.
4. Execute the following command to restart the pxc-node containers that have stopped:
kubectl apply -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node{1/2/3}.yaml
5. Execute the kubectl logs -f command to obtain the database container startup log.
In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.
PXC database in abnormal state
Symptom
· Symptom 1:
The services associated with the PXC database can connect to the database but cannot use the database. For example, a service might receive a response like WSREP has not yet prepared node for application use.
· Symptom 2:
The services associated with the PXC database can connect to the database but cannot use the database. The log might receive a response indicating that it timed out waiting for a lock to be unlocked and you can try to send the request again.
· Symptom 3:
The services associated with the PXC database can connect to the database but cannot use the database. No response is returned from the database.
Cause analysis
· Cause analysis for symptom 1:
This symptom might occur if a brain split occurs on the database cluster. Typically, the database cluster can recover to normal automatically. You can access the CLI of a master node to identify the state of the database cluster. If the database cluster has recovered to normal, the query result for the wsrep_local_state_comment, wsrep_ready, and wsrep_incoming_addresses fields are as shown in the following figure:
If the query result is different, the cluster has not recovered from the brain split. For example:
¡ The value for the wsrep_local_state_comment field might be Initialized or Joining: receiving State Transfer.
¡ The value for the wsrep_ready field might be OFF, which indicates that the current container is not available.
¡ The wsrep_incoming_addresses field does not display IP addresses of all pxc-node containers, which indicates that not all the containers are in the same database cluster.
· Cause analysis for symptom 2:
This symptom might occur if the database has dead locks. The dead locks might be metadata locks or exclusive locks.
· Cause analysis for symptom 3:
This symptom might occur if data synchronization across the database cluster failed.
Solution
1. Execute the following command to delete Pods in the database cluster:
kubectl get pod -n service-software -o wide | grep pxc-node | awk '{print $1}' | xargs kubectl -n service-software delete pod
2. Execute the kubectl logs -f command to obtain the database container startup log.
¡ In cluster mode, if the log shows all pxc node start up at the end, the database cluster is repaired successfully.
¡ In standalone mode, if the log shows mysql state is Synced, the database is repaired successfully.
Loss of all contents of PXC database startup file grastate.dat
Symptom
In standalone mode, the PXC database fails to start up after the server shuts down unexpectedly and then restarts. As the result, pods dependent on the PXC database fail to start. The issue persists after the PXC database pod is restarted.
· View the grastate.dat file of the PXC database in the CLI of the node. It is found that the file is empty and the file content is lost.
· The database cannot be connected.
· The pods that use the database service are abnormal.
· The following error messages are displayed for the PXC Pod:
Solution
1. Execute the following command to stop container pxc-node1 that failed to start up.
kubectl delete -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node1.yaml
2. Execute the vim grastate.dat command, add the following contents to the file, and then save the file.
# GALERA saved state
version: 2.1
uuid: 2013b697-a063-11ed-b00e-d340082886cf
seqno: -1
safe_to_bootstrap: 1
Figure 7 Viewing the grastate.dat file content
3. Execute the following command to restart the stopped container pxc-node1.
kubectl apply -f /opt/matrix/app/install/metadata/UCENTER/portal/portal/common/k8s-resources/pxc-node1.yaml
4. Execute the kubectl logs -f command to view the startup logs of the database container. The database has been repaired if a message "mysql state is Synced message" is printed.
5. Verify correct connectivity of the database.