H3C Servers NVMe Drive Online Replacement User Guide-6W100

HomeSupportResource CenterH3C Servers NVMe Drive Online Replacement User Guide-6W100

 

H3C Servers

NVMe Drive Online Replacement User Guide

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Version: 6W100-20220107

 

Copyright © 2022 New H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

The information in this document is subject to change without notice.



About NVMe drives

H3C servers support the following types of NVMe drives: NVMe U.2 SSD, NVMe M.2 SSD, and PCIe accelerator module. This document describes online replacement of NVMe U.2 SSDs (hereinafter referred to as NVMe drives) because the other two types do not support online replacement.

NVMe drives support the following online replacement methods:

·     Hot swappingAllows users to directly remove an old drive and insert a new one when the operating system runs correctly.

·     Managed hot swappingRequires users to first execute a command for the OS to complete a hot removal process before physically removing a drive and inserting a new one.

An NVMe drive is faulty if one of the following conditions exists:

·     Drive Fault/UID LED is steady amber.

·     HDM Web page displays that the drive is in abnormal status.

·     HDM logs indicate that the drive fails.

·     The operating system logs indicate that the drive fails.

For more faulty cases and symptoms of NVMe drives, see H3C Servers Troubleshooting Guide.

Applicable products

This document is applicable to the following products:

·     For G5 models:

¡     H3C UniServer R4300 G5

¡     H3C UniServer R4700 G5

¡     H3C UniServer R4900 G5

¡     H3C UniServer R6900 G5

·     For G3 models:

¡     H3C UniServer R2700 G3

¡     H3C UniServer R2900 G3

¡     H3C UniServer R4300 G3

¡     H3C UniServer R4700 G3

¡     H3C UniServer R4900 G3

¡     H3C UniServer R6700 G3

¡     H3C UniServer R6900 G3

¡     H3C UniServer R8900 G3


Before you begin

Before replacing an NVMe drive when the server is operating, prepare the software and hardware for the replacement and identify the online replacement method.

Preparing the hardware

Prepare the hardware as follows:

·     To avoid data loss, stop the NVMe drive services and back up data.

·     Go to Query tool for operating system compatibility to get information about drives supported by the server, and make sure the server is compatible with the drive.

·     For information about the installation location of the drive, see the product user guide.

Preparing the software

Prepare the software as follows:

·     Make sure the BIOS and HDM software versions meet the following requirements:

¡     For G5 servers:

-     BIOS: Version BIOS-5.06 or later.

-     HDM: Version HDM-2.13 or later.

¡     For G3 servers:

-     BIOS: Version BIOS-2.00.39 or later.

-     HDM: Version HDM-1.30.18 or later.

·     Make sure VMD is enabled. For more information about VMD, see the BIOS user guide for the server. To perform a managed hot removal of an NVMe drive when VMD is disabled, contact Technical Support.

·     Make sure the number of member drives to be removed from a RAID setup does not exceed the maximum allowed number of failed drives as described in Table 1.

Table 1 Number of hot-swappable drives from a RAID setup

RAID level

Required drives

Max. failed drives

RAID 0

≥ 2

0

RAID 1

2

1

RAID 5

≥ 3

1

RAID 10

4

2

NOTE:

Make sure the two failed drives are in different RAID 1 setups.

 

Identifying the replacement method

Visit Query tool for operating system compatibility, click the tab for NVMe drive and OS compatibility, and query the online replacement methods supported by the current server configuration (processor platform, VMD status, and operating system). If the server supports both methods, use either method as needed. If the server supports neither method, contact Technical Support.

 


Hot swapping of an NVMe drive

This chapter describes hot removal and hot insertion procedures.

Hot swapping flowchart

Figure 21 Hot swapping flowchart

 


Performing hot removal

Performing hot removal in Windows

Prerequisites

Before replacing an NVMe drive in Windows, make sure the driver version of Intel® VROC is consistent with or later than the version of VROC PreOS in the BIOS. You can access the H3C official website to obtain the latest driver version of Intel® VROC.

To query the VROC driver version and VROC PreOS version:

1.     Obtain and open Intel®Virtual RAID on CPU, and click the  icon on the navigation bar, as shown in Figure 22.

 

IMPORTANT

IMPORTANT:

To obtain Intel® Virtual RAID on CPU, use one of the following methods:

·     Go to the H3C official website and download the software.

·     Use an Intel authorized account to log in to the Intel official website and download the software.

The tool user guide is included in the installation package. You can refer to the guide to install and use the tool.

 

Figure 22 Clicking the question icon on the navigation bar

 

2.     Click System Report on the navigation bar, as shown by the callout 1 in Figure 23.

a.     View the version information of VROC PreOS, as shown by the callout 2 in Figure 23.

b.     View the version information of Intel® VROC, as shown by the callout 3 in Figure 23.

Figure 23 Viewing VROC PreOS version information

 

Procedure

1.     Stop services on the NVMe drive.

2.     Back up the drive data.

3.     Run Intel® Virtual RAID on CPU to view NVMe drives.

Figure 24 Viewing NVMe drives

 

4.     Select the NVMe drive to be removed from the Devices list and identify the drive location.

This example removes the NVMe drive from Controller 0, Port1.

Figure 25 Identifying the NVMe drive location

 

5.     (Optional.) If the NVMe drive to be removed is a member drive in a RAID setup configured with hot spares, view the RAID rebuild status.

¡     If RAID rebuild is in progress, wait for the process to finish.

 

CAUTION

CAUTION:

To avoid data loss, do not perform any operations on the NVMe drive during RAID rebuild.

 

Figure 26 RAID rebuild in progress

 

¡     If RAID rebuild is complete (hot spares become the member drives), go to step 6.

Figure 27 RAID rebuild completed

 

6.     Click Activate LED for the drive. The Fault/UID LED on the physical drive will turn steady blue for 10 seconds and then turn off automatically. The Present/Active LED will be steady green.

Figure 28 Activating the LEDs for the NVMe drive

 

7.     Remove the NVMe drive. The removal method is similar for different server models.

 

CAUTION

CAUTION:

·     Do no perform frequent swapping. If you swap a drive repeatedly within 30 seconds, the system might fail to identify the drive.

·     To avoid system errors, remove only one drive at a time. To remove multiple drives, uninstall them one after another, and remove one drive only after you completely remove the previous one.

 

a.     Press the button on the drive panel to release the locking lever, as shown by the callout 1 in Figure 29.

b.     Pull the drive out of the drive cage, as shown by the callout 2 in Figure 29.

Figure 29 Removing the drive

 

c.     Remove the drive from the carrier. Remove all screws that secure the drive on the carrier, and remove the drive from the carrier.

Figure 30 Removing the drive from the carrier

 

d.     Place the removed drive into an antistatic bag.

 

CAUTION

CAUTION:

Do not place two components into one antistatic bag.

 

Performing hot removal in Linux

1.     Execute the lsblk | grep nvme command to identify the drive letter of the NVMe drive to be removed.

This procedure uses drive nvme2n1 as an example.

Figure 31 Identifying the drive letter of the NVMe drive to be removed

 

2.     Stop services on the NVMe drive.

3.     Back up the drive data.

4.     Identify the type of the drive to be removed:

¡     For a passthrough drive, proceed to step 5.

¡     For a member drive in a RAID setup configured with hot spares, proceed to step 6.

¡     For a member drive in a RAID setup that is not configured with hot spares, proceed to step 7.

5.     If the NVMe drive is a passthrough drive:

e.     Execute the df -h command to identify the mounting status of the NVMe drive. As shown in Figure 32, drive nvme2n1 has been mounted.

Figure 32 Viewing the mounting status of the NVMe drive

 

c.     Execute the umount /dev/nvme2n1 command to unmount the drive.

Figure 33 Unmounting the NVMe drive

 

d.     Execute the df –h command again and verify that the drive has been unmounted.

Figure 34 Verifying that the drive has been unmounted

 

e.     Move on to step 8.

6.     If the NVMe drive is in a RAID setup with hot spares configured:

e.     Execute the df –h command to identify the mounting status of the NVMe drive. As shown in Figure 35, drive md0p1 has been mounted.

Figure 35 Viewing the mounting status of the NVMe drive

 

b.     Execute the umount /dev/md0p1 command to unmount the drive.

Figure 36 Unmounting the NVMe drive

 

c.     Execute the df –h command again and verify that the drive has been unmounted.

Figure 37 Verifying that the drive has been unmounted

 

d.     Execute the cat /proc/mdstat command to view the RAID rebuild status.

 

CAUTION

CAUTION:

To avoid data loss, do not perform any operations on the NVMe drive during RAID rebuild.

 

-     If RAID rebuild is in progress as shown in Figure 38, wait for the process to finish.

-     If RAID rebuild is complete (hot spares become the member drives) as shown in Figure 39, proceed to the next step.

Record the number of the container in which the drive resides before proceeding to the next step. In this example, the number is /md127.

Figure 38 RAID rebuild in progress

 

Figure 39 RAID rebuild completed

 

e.     Execute the mdadm –r /dev/md127 /dev/nvme2n1 command to remove drive nvme2n1 from the container, as shown in Figure 40.

Figure 40 Removing the NVMe drive from the container

 

f.     Execute the cat /proc/mdstat command and verify if the drive has been removed successfully. As shown in Figure 41, the drive has been removed from the container.

Figure 41 Verifying the drive removal

 

g.     Go to step 8

7.     If the NVMe drive is in a RAID setup without hot spares configured, perform the following tasks:

a.     Verify the NVMe drive mounting status. If the drive has been mounted, unmount it. For more information, see step e.

b.     Proceed to the next step.

8.     Identify the location of the NVMe drive on the server.

a.     Execute the find /sys/devices –iname nvme2n1 command to identify the bus number of the drive. As shown in Figure 42, the bus number for the drive is 10000:04:00.0.

Figure 42 Identifying the bus number

 

b.     Execute the lspci –vvs 10000:04:00.0 command to identify the PCIe slot number. As shown in Figure 43, the PCIe slot is 109.

Figure 43 Identifying the PCIe slot number

 

c.     Identify the physical slot number of the drive. Log in to HDM, access the Storage > Physical View page, and identify the physical slot number in HDM for the PCIe slot number. In this example, the physical slot in HDM is Front slot 9.

Items on HDM Web page might vary by HDM version, but the procedures to view NVMe drive information are similar.

Figure 44 Identifying the physical slot number in HDM

 

d.     Use Table 2 to identify the drive slot number on the server. In this example, Front slot 9 in HDM represents slot 9 on the server, as shown by Figure 45.

Table 2 Drive slot number mappings

 

Figure 45 Drive slot numbering on a server

 

9.     Remove the NVMe drive. For more information, see "Remove the NVMe drive."

Performing hot removal in VMware

1.     Identify the NVMe drive to be removed. As shown in Figure 46, click the Devices tab from the VMware ESXi management GUI.

This procedure uses drive t10.NVMe__INTEL_SSDPE2KE016T8_______BTLN813609NS1P6AGN_00000001 as an example.

Figure 46 Identifying the NVMe drive to be removed

 

2.     Stop services on the NVMe drive to be removed.

3.     Back up the drive data.

4.     Click the drive name to view its mounting status:

¡     If partitions exist, go to step 5 to unmount the drive.

¡     If no partition exists, turn on the LEDs on the drive. For more information, see step 6.

Figure 47 Viewing the mounting status

 

5.     (Optional.) Unmount the NVMe drive.

a.     Click the Datastores tab to view the mounted NVMe drives.

Figure 48 Viewing the mounted NVMe drives

 

b.     Click the drive and verify its name. Make sure it is the drive you are to remove.

Figure 49 Verifying the drive name

 

c.     Click Actions and then select Unmount from the list that opens. In the dialog box that opens, click Yes.

Figure 50 Unmounting the NVMe drive

 

Figure 51 Confirming the drive removal

 

d.     Click the Datastores tab to view the drive removal status. As shown in Figure 52, the drive capacity is 0 B, indicating that the NVMe drive has been removed successfully.

Figure 52 Viewing the drive removal status

 

6.     Turn on the LEDs on the NVMe drive to identify the location of the NVMe drive on the server.

The LED lighting tool varies by VROC driver version. For version 2.0 or earlier, the tool name is intel-vmd-user. For versions later than 2.0, the tool name is intel-vmdr-user. You can access the H3C official website to obtain the latest VROC driver compatible with the server. This example uses intel-vmd-user.

a.     Execute the esxcfg-mpath –L command to view the SCSI ID for the NVMe drive. As shown in Figure 53, the VMD adapter for the drive is vmhba2 and the drive number is T1.

Figure 53 Viewing the SCSI ID for the NVMe drive

 

b.     Execute the cd /opt/intel/bin/ command to access the directory where the intel-vmd-user tool resides.

Figure 54 Accessing the directory where the tool resides

 

c.     Execute the /intel-vmd-user set-led vmhba2 –d 1 –l identify command to turn on LEDs on the drive. 1 represents T1, the drive number.

Figure 55 Turning on LEDs on the drive

 

d.     Observe the LEDs on the NVMe drive. You can remove the NVMe drive after the Fault/UID LED turns steady blue and the Present/Active LED turns steady green.

7.     Remove the NVMe drive. For more information, see "Remove the NVMe drive."

Performing hot insertion

The installation method is similar on different servers. This example uses an R4300 G5 server.

Restrictions and guidelines

To avoid system errors, install only one drive at a time. To install multiple drives, insert them one after another, and install one drive only after you completely install the previous one.

Do no perform frequent swapping. If you swap a drive repeatedly within 30 seconds, the system might fail to identify the drive.

If the replaced drive is in a RAID setup, make sure the capacity of the new drive equals to or is larger than the minimum capacity of a member drive in the RAID setup.

Performing hot insertion in Windows

1.     Install an NVMe drive.

a.     Install the drive onto the carrier. Install the four screws into the screw holes and then fasten the screws in sequence.

Figure 56 Installing the drive onto the carrier

 

b.     Press the button on the drive panel to release the locking lever.

Figure 57 Releasing the locking lever

 

c.     Insert the drive into the slot until you cannot push it any further.

 

IMPORTANT

IMPORTANT:

To avoid freezing or restarting the operating system, insert the drive at a constant speed and do not pause during the process.

 

d.     Close the locking lever until you hear a click.

Figure 58 Installing a drive

 

2.     Observe the LEDs on the NVMe drive. The NVMe drive is present in the slot without any faults if the Present/Active LED is steady green and the Fault/UID LED is off.

3.     Run Intel® Virtual RAID on CPU to view the operating status of the NVMe drive as shown in Figure 59. Verify that the drive properties is consistent with the actual drive specifications.

Figure 59 Verifying the status of the installed NVMe drive in Windows

 

Performing hot insertion in Linux

1.     Install an NVMe drive. For more information, see "Performing hot insertion in Windows."

2.     Observe the LEDs on the NVMe drive. The NVMe drive is present in the slot without any faults if the Present/Active LED is steady green and the Fault/UID LED is off.

3.     Verify that the drive has been identified by the system. Execute the lspci –vvs bus_number command and verify that you can view information about the NVMe drive.

To obtain the bus number, see "Performing hot removal in Linux."

Figure 60 Verifying that the drive has been identified by the system

 

4.     Verify that the drive has been installed correctly. Execute the lsblk command and verify that you can view the drive letter of the drive.

Figure 61 Viewing the drive letter of the drive

 

Performing hot insertion in VMware

1.     Install an NVMe drive. For more information, see "Performing hot insertion in Windows."

2.     Observe the LEDs on the NVMe drive. The NVMe drive is present in the slot without any faults when the Present/Active LED is steady green and the Fault/UID LED is off.

3.     Verify that the drive has been installed successfully. Execute the esxcfg-mpath –L command and verify that you can find the drive in the command output.

Figure 62 Verifying that the drive has been installed successfully

 


Managed hot swapping of an NVMe drive

This chapter describes managed hot removal and hot insertion procedures.

Managed hot removal flowchart

Figure 63 Managed hot removal flowchart

 

Performing managed hot removal

Performing managed hot removal in Windows

Prerequisites

For more information, see "Prerequisites."

Procedure

1.     Stop services on the NVMe drive.

2.     Back up the drive data.

3.     Run Intel® Virtual RAID on CPU to view NVMe drives. For more information, see step 3 in "Performing hot removal in Windows."

4.     Identify the NVMe drive location. For more information, see step 4 in "Performing hot removal in Windows."

5.     (Optional.) If the NVMe drive is in a RAID setup configured with hot spares, view the RAID rebuild status. For more information, see step 5 in "Performing hot removal in Windows."

6.     Click Activate LED to turn on the LEDs on the drive, as shown by callout 1 in Figure 64. The Fault/UID LED on the physical drive will turn steady blue for 10 seconds and turn off automatically. The Present/Active LED will turn steady green.

7.     Click Remove Disk, as shown by callout 2 in Figure 64.

Figure 64 Removing the NVMe drive

 

8.     Observe the drive status. Make sure the Fault/UID LED is steady blue and the Present/Active LED turns steady green, and the NVMe drive is removed from the Devices list of Intel® Virtual RAID on CPU.

9.     Remove the NVMe drive. For more information, see step 7 in "Performing hot removal in Windows."

Performing managed hot removal in Linux

1.     Identify the drive letter of the NVMe drive to be removed. For more information, see step 1 in "Performing hot removal in Linux."

2.     Stop services on the NVMe drive.

3.     According to the status of the NVMe drive to be removed, perform the following tasks:

¡     For a passthrough drive, view the mounting status of the drive. If the drive has been mounted, first unmount it. For more information, see step 5 in "Performing hot removal in Linux."

¡     For a RAID member configured with hot spares, view the RAID rebuild status. For more information, see step 6 in "Performing hot removal in Linux."

¡     For a RAID member not configured with hot spares, unmount the drive first. For more information, see step 7 in "Performing hot removal in Linux."

4.     (Optional.) If the RHEL7.6, RHEL8.0, SUSE15, SUSE15SP1, or SUSE12SP4 operating system is used, create the ledmon.service. This step is required for drive location because the Fault/UID LED cannot be lighted by the removal command in these systems.

a.     Execute the vim /usr/lib/systemed/system/ledmon.service command to create a ledmon.service file.

Figure 65 Creating a ledmon.service file

 

b.     Edit the file.

Figure 66 Editing the file

 

c.     Start the ledmon.service.

By default, the ledmon.service is disabled in RHEL8.2. You must enable the service manually.

Figure 67 Starting the ledmon.service

 

5.     Remove the NVMe drive from the operating system and verify the removal:

a.     Execute the echo 1 > /sys/block/nvme2n1/device/device/remove command to remove the drive. In this example, drive nvme2n1 is removed.

Figure 68 Removing the NVMe drive

 

b.     Execute the lsblk command and verify that the drive is not displayed in the command output.

Figure 69 Verifying the removal

 

6.     Observe the LEDs on the NVMe drive. You can remove the NVMe drive after the Fault/UID LED turns steady amber and the Present/Active LED turns steady green.

7.     Remove the NVMe drive. For more information, see step 7 in "Performing hot removal in Windows."

Performing hot insertion

For more information, see "Performing hot insertion."


Appendix  RAID issues for NVMe drive replacement

Operating mode of the new drive

·     If the removed NVMe drive is a passthrough drive, the new NVMe drive also functions as a passthrough drive.

·     The new drive functions as a passthrough drive in the following situations even if RAID is configured for the removed drive:

¡     The removed NVMe drive is a member drive in a RAID setup that does not offer redundancy.

¡     The removed NVMe drive is a member drive in a RAID setup that offers redundancy, but the drive does not have hot spares and is disabled with RAID rebuild.

¡     The removed NVMe drive is a member drive in a RAID setup that offers redundancy and is configured with hot spares.

You can configure RAID as needed. For more information, see the storage controller user guide.

RAID rebuild

If you removed an NVMe drive in a redundant RAID setup configured with no hot spares but enabled with RAID rebuild, the storage controller automatically rebuild the RAID after the replacement.

Procedures for viewing the RAID rebuild status vary by operating system.

Support for RAID rebuild and the default feature state depends on storage controllers. For more information, see the storage controller user guide.

Make sure the capacity of the new drive equals to or is larger than the minimum capacity of a member drive in the RAID setup.

Viewing the RAID rebuild status in Windows

Run Intel® Virtual RAID on CPU to view the RAID rebuild status.

Figure 70 RAID rebuild completed

 

Figure 71 RAID rebuild in progress

 

Viewing the RAID rebuild status in Linux

For a RAID setup configured with hot spares and enabled with RAID rebuild

1.     Remove the old drive after the RAID rebuild is complete. For more information, see step 4 in "Performing hot removal in Linux."

2.     Install the new drive. If the Present/Active LED is green and the Fault/UID LED is steady amber, the new NVMe drive functions as a passthrough drive.

3.     Execute the ledctl off=/dev/<name of the new drive> command, and turn off the Fault/UID LED, as shown in Figure 72.

Figure 72 Turning off the Fault/UID LED

 

4.     (Optional.) Execute the mdadm -a /dev/<container name > /dev/<name of the new drive> command to add the new drive into the old RAID container. The new drive functions as a hot spare.

For a RAID setup configured with no hot spares and enabled with RAID rebuild

1.     After installing the new drive, the Present/Active LED turn green and the Fault/UID LED turns steady amber. Before the RAID rebuild, you must add the new drive into the container. Execute the mdadm -a /dev/<the name of the container> /dev/<the name of the new drive> command to add the new drive into the container. The storage controller will automatically rebuild the RAID.

2.     Execute the cat /proc/mdstat command to view the RAID rebuild status.

Figure 73 RAID rebuild completed

 

Figure 74 RAID rebuild in progress

 

3.     After the RAID rebuild complete, verify that the Fault/UID LED turns off and the Present/Active LED turns steady green.