H3C Lossless Network Best Practices-6W100

HomeSupportResource CenterConfigure & DeployBest PracticesH3C Lossless Network Best Practices-6W100
Download Book
Table of Contents
Related Documents

H3C Lossless Network Best Practices

S6850 Switch Series

 

S9850 Switch Series

 

S9820-8C

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copyright © 2022 New H3C Technologies Co., Ltd. All rights reserved..

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

The information in this document is subject to change without notice.

H3C_彩色.emf

 



Overview

Technical background

High-performance computing (HPC), big data analysis, artificial intelligence (AI), and Internet of Things (IoT) are developing fast, and the centralized/distributed storage and cloud databases are widely used. As a result, service applications need to obtain more and more data from networks, leading to higher requirement for the switching speed and performance of datacenter networks.

In traditional TCP/IP software and hardware architectures and applications, the network transmission and data processing delays are long, data are copied and interrupted multiple times, and TCP/IP protocol processing is complicated. Remote Direct Memory Access (RDMA) reduces the data processing delay on servers during the network transmission process. RDMA directly transmits the user application data to the storage space of the servers, and uses the network to fast transmit the data from the local system to the storage of the remote system. RDMA eliminates multiple data copying and text exchange operations during the transmission process, and reduces the CPU load.

Figure 1 Traditional TCP/IP data transmission process

 

Figure 2 RDMA data transmission process

 

However, the InfiniBand network architecture dedicated to RDMA is closed, and must use IB switches and IB NICs. The InfiniBand network architecture cannot be compatible with the current network, and needs high costs. The RDMA over Converged Ethernet (RoCE) technology was introduced to effectively resolve these issues. RoCE supports carrying IB over Ethernet to implement RDMA over Ethernet.

RoCE has the following versions:

·     RoCEv1—Carries RDMA over Ethernet. RoCEv1 can be deployed only on Layer 2 networks.

·     RoCEv2—Carries RDMA over UDP/IP protocols. RoCEv2 can be deployed on Layer 3 networks.

Currently, all RDMA applications use the RoCEv2 protocol. However, because RDMA is initially designed based on a lossless InfiniBand network, RoCEv2 does not provide a perfect packet loss protection mechanism. The loss of any packet will cause a large number of retransmissions, and will seriously affect data transmission performance. At the same time, the applications related to HPC and distributed storage use the N:1 incast traffic model. Therefore, these applications easily cause transient congestion or even packet loss in the queue buffers of Ethernet switches. As a result, the latency will increase and the throughput will decrease for applications, and the distributed application performance will decrease. To support RoCEv2 and related applications on Ethernet switches, you must first deploy a lossless network on Ethernet switches.

Building a lossless Ethernet network to support RoCE

On an RoCE network, you must build a lossless Ethernet network to ensure zero packet loss. Lossless Ethernet must support the following key features:

·     Data buffer management and monitoring—Adjusts the buffer space that each interface or queue can use according to traffic characteristics, and reports buffer usage conditions at the CLI or through gRPC.

Adjust the data buffer size under the guidance of professionals. As a best practice, configure data buffer monitoring.

·     (Required.) PFC—Priority-based Flow Control. PFC provides per-hop priority-based flow control for multiple types of traffic separately.

·     (Required.) ECN—Explicit Congestion Notification. When the device is congested, ECN marks the ECN field in the IP header of a packet. The receiver sends congestion notification packets (CNPs) to notify the sender to slow down the sending speed. ECN implements end-to-end congestion management and reduces the spread and deterioration of congestion.

·     (Recommended.) DCBX—Data Center Bridging Capability Exchange Protocol. DCBX uses LLDP to autonegotiate DCB capabilities, including PFC and ETS capabilities. Typically, DCBX is used on the interface connecting the switch to the server, and negotiates capabilities with the server NIC.

·     (Optional.) ETS—Enhanced Transmission Selection. ETS classifies traffic by service type, provides minimum guaranteed bandwidth for different traffic types, and improves the link utilization. ETS must be configured hop by hop.

Figure 3 Key features for building a lossless Ethernet

 

On an RoCE network, PFC must be used together with ECN to guarantee both zero packet loss and bandwidth. Table 1 compares PFC and ECN.

Table 1 PFC vs ECN

Item

PFC

ECN

Network location

Layer 2

Network layer and transport layer

Effective scope

Point-to-point

End-to-end

Need for network-wide support

Yes

No

Controlled objects

Previous node on the network (if the server NIC supports PFC, PFC also takes effect on the NIC)

Sender host

Packet buffer location

Intermediate nodes and sender

Sender

Affected traffic

All traffic in one of the eight queues on the device

Congested connection

Response speed

Fast

Slow

 

This document describes the key technologies for building a lossless network, including data buffers, PFC, and ECN. For how to configure these technologies and other lossless network-related technologies, see the configuration guides and command references for your devices.

Data buffers

Data buffers temporarily store packets to avoid packet loss.

The following data buffers are available:

·     Ingress buffer—Stores incoming packets when the CPU is busy.

·     Egress buffer—Stores outgoing packets when network congestion occurs.

·     Headroom buffer—Stores packets when the ingress buffer or egress buffer is used up.

Figure 4 shows the structure of ingress and egress buffers.

Figure 4 Data buffer structure

 

By default, all queues have an equal share of the shared area and fixed area. You can change the shared-area or fixed-area size for a queue. The unconfigured queues use the default setting.

PFC

About PFC

PFC is required for building a lossless Ethernet network. PFC provides per-hop priority-based flow control. When the device is forwarding packets, the device assigns packets to queues for scheduling and forwarding through looking up packet priorities in priority mapping tables. When the sending rate of packets carrying an 802.1p priority exceeds the receiving rate and the data buffer space on the receiver is insufficient, the receiver sends PFC pause frames to the sender. When the sender receives the PFC pause frames, the sender stops sending packets with the specified 802.1p priority until the sender receives PFC XON frames or the aging timer expires. When PFC is configured, the congestion of packets of a specific type does not affect the normal forwarding of the other types of packets, and different types of packets on a link are forwarded independently.

Mechanism

PFC pause frame generation mechanism

Figure 5 How PFC pause frames are generated

 

As shown in Figure 5, PFC pause frames are generated in the following process:

1.     When Port 1 of Device B receives packets from Device A, the memory management unit (MMU) of Device B allocates cell resources to the packets. If PFC is enabled on Device B, Device B counts the cell resources occupied by packets with each 802.1p priority.

 

 

NOTE:

Cell resources are used to store packets. An interface allocates cell resources to packets based on packet sizes. Suppose a cell resource provides 208 bytes. An interface allocates one cell resource to a 128-byte packet and two cell resources to a 300-byte packet.

 

2.     When the cell resources used by packets carrying a certain 802.1p priority exceed the set threshold on Port 1 of Device B, Port 1 of Device B sends PFC pause frames for the 802.1p priority to Device A.

3.     When Device A receives the PFC pause frames for the 802.1p priority, Device A stops sending out packets carrying the 802.1p priority and buffers these packets. If the buffer threshold for the 802.1p priority is reached, Device A sends PFC pause frames for the 802.1p priority to its upstream device, as shown in Figure 6.

Figure 6 PFC pause frame processing between multiple devices

 

Packet priority-to-queue mappings

When a device forwards packets, packets with different priority values are assigned to different queues for scheduling and forwarding. The packet priority-to-queue mappings depend on the priority mapping method configured. The device supports the following priority mapping methods:

·     Configuring priority trust mode—In this method, you can configure an interface to trust the specified type of priority carried in packets. Then, the device looks up the trusted priority type in incoming packets in the priority maps and modifies the priority values in packets based on the priority maps. Packets are scheduled within the device based on the priorities. Available priority trust modes include:

¡     dot1p—Trusts the 802.1p priority carried in packets and uses the 802.1p priority for priority mapping.

¡     dscp—Trusts the DSCP priority carried in IP packets and uses the DSCP priority for priority mapping.

·     Changing port priority—If no priority trust mode is configured for an incoming interface, the port priority of the incoming interface is used for priority mapping. By changing the port priority of an interface, you can change the priority of incoming packets on the interface. Then, packets received on different incoming interfaces can be assigned to the corresponding queues and scheduled differentiatedly.

When configuring PFC on an interface, you must configure the interface to trust the 802.1p or DSCP priority carried in packets. When the interface receives Ethernet packets, the interface marks local precedence values for packets according to the priority trust mode and the 802.1Q tagging status of packets. Then, the packets are scheduled based on their local precedence values. Figure 7 shows the detailed process.

 

 

NOTE:

This document describes only the packet priority to local precedence mappings when the interface trusts the 802.1p or DSCP priority carried in packets. For information about the port priority configuration and the drop precedence (used as a reference for dropping packets), see the configuration guide for your device.

 

Figure 7 Packet priority to queue mappings

 

PFC restrictions and guidelines

For PFC to work properly on an interface, perform the following tasks:

·     Configure the interface to trust the 802.1p or DSCP priorities carried in packets by using the qos trust { dot1p | dscp } command.

·     Make sure the 802.1p-to-local priority map and DSCP-to-802.1p priority map are the same on all interfaces along the transmission path.

You can configure PFC in both system view and Ethernet interface view. If you configure PFC in system view and Ethernet interface view multiple times, the most recent configuration takes effect.

For IRF and other protocols to operate correctly, as a best practice, do not enable PFC for 802.1p priority 0, 6, or 7.

To perform PFC on an IRF fabric, configure PFC on the  IRF physical interfaces that are bound to the IRF port. For information about IRF, see Virtual Technologies Configuration Guide.

To perform PFC in an overlay network, execute the qos trust tunnel-dot1p command. For information about the overlay network, see VXLAN Configuration Guide. For information about the qos trust tunnel-dot1p command, see ACL and QoS Command Reference.

To avoid packet loss, apply the same PFC configuration to all interfaces that the packets pass through.

Setting PFC thresholds

About PFC thresholds

By configuring the PFC buffer thresholds, you can avoid tail drop in the sending data buffer when the buffer space is insufficient for incoming packets.

Figure 8 PFC buffer thresholds

 

The device supports the following PFC thresholds:

·     Headroom buffer threshold—Maximum number of cell resources that can be used by packets with a specific 802.1p priority value in a headroom storage space, as indicated by headroom in Figure 8. An interface drops received packets once this threshold is reached.

·     Back pressure frame triggering threshold—Maximum number of cell resources that can be used by packets with a specific 802.1p priority value in a shared storage space, as indicated by XOFF in Figure 8. PFC is triggered to send PFC pause frames when this threshold is reached. The back pressure frame triggering threshold includes the following types:

¡     Dynamic back pressure frame triggering threshold—Maximum cell resources set in percentage.

¡     Static back pressure frame triggering threshold—Maximum cell resources set in an absolute value.

The XoffThres and IsDynamic fields in the display buffer usage interface verbose command output represent the back pressure frame triggering threshold and its type.

·     Offset between the back pressure frame stopping threshold and triggering threshold—As indicated by ingress-threshold-offset in Figure 8. When the number of cell resources used by packets with a specific 802.1p priority value decreases by this offset after PFC is triggered, PFC will stop sending PFC pause frames.

·     PFC reserved threshold—As indicated by reserved-buffer in Figure 8. Number of cell resources reserved for packets with a specific 802.1p priority value in a guaranteed storage space.

·     Maximum number of cell resources in a headroom storage space—Number of cell resources allocated to the headroom storage space in a storage pool (only pool 0 is supported in the current software version).

PFC threshold restrictions and guidelines

After PFC is enabled for 802.1p priorities, the PFC thresholds use the default values, which are adequate in typical network environments. As a best practice, change the PFC thresholds only when necessary. If the network environment or traffic is complicated, consult the professionals to adjust the PFC thresholds.

Headroom buffer threshold

The recommended headroom buffer threshold for an interface depends on the transmission speed and transmission distance of the interface.

·     For a 100-GE interface:

¡     When the transmission distance is 300 meters, you only need to execute the priority-flow-control no-drop dot1p command on the interface. This command automatically sets the default headroom buffer threshold, which is adequate for implementing zero packet loss for the transmission distance.

¡     When the transmission distance is 10000 meters, you must use the priority-flow-control dot1p headroom command on the interface to set the headroom buffer threshold. As a best practice, set the headroom buffer threshold to 9984. You can adjust the threshold as needed to ensure zero packet loss for the transmission distance.

¡     When the transmission distance is longer than 10000 meters, the device series does not support transceiver modules with the transmission distance in the current software version.

·     For a non-100-GE interface:

¡     When the transmission distance is 300, 10000, or 20000 meters, you only need to execute the priority-flow-control no-drop dot1p command on the interface. This command automatically sets the default headroom buffer threshold, which is adequate for implementing zero packet loss for the transmission distance.

¡     When the transmission distance is 40000 meters, you must use the priority-flow-control dot1p headroom command on the interface to set the headroom buffer threshold. As a best practice, set the headroom buffer threshold to 9984. You can adjust the threshold as needed to ensure zero packet loss for the transmission distance.

To set a more exact headroom buffer threshold, calculate the threshold as follows:

1.     Calculate the size of in-transit traffic (traffic transmitted between the time the receiver sends PFC pause frames and the time the sender receives PFC pause frames and actually stops sending packets). The formula is in-transit traffic (bytes) = MTUR + MTUs + Response + 2*link_delay, where the parameters are defined as follows:

¡     MTUR—Length of a large packet if the chip needs to send a large packet between the time the back pressure frame trigger threshold is triggered and the time the device sends PFC pause frames. Because the large packet might come from any queue, the packet length must be the maximum length of frames allowed to pass through, 9216 bytes.

¡     MTUs—Length of the packet that the upstream device sends immediately before stopping sending packets. Because only packets with the specified 802.1p priority affect the threshold on the receiver, the MTUs is the length of a packet with the 802.1p priority specified by PFC.

¡     Response—Length of data generated within the response time between the time the upstream device receives PFC pause frames and the time the device stops sending packets. The value is fixed at 3840 bytes.

¡     2*link_delay—Length of the forward and backward buffered packets on the link, as shown in the following table:

 

Interface speed

Buffered packets (bytes) in each 100 meter of cable

10 Gbps

1300

25 Gbps

3250

40 Gbps

5200

100 Gbps

13000

 

In conclusion, in-transit traffic (bytes) = 9216 + MTUs + 3840 + N*packets buffered (bytes) in each 100 meter of cable, where N is the cable length (in 100 meters).

2.     Calculate the number of cell resources needed by the in-transit traffic. At the same time, you need to consider the influence of packet length on cell resource allocation. In extreme conditions where each packet is of 64 bytes, most cell resources are occupied.

¡     On the S6850 switch series and S9850 switch series, each cell resource provides 256 bytes.

¡     On the S9820-8C switch series, each cell resource provides 254 bytes.

Maximum number of available cell resources in the headroom

The headroom resources are preempted by all ports and all queues on the device. Consider the headroom space needed by each port and the number of ports, and then determine a small value that can meet the requirements.

Back pressure frame triggering threshold

When determining the back pressure frame triggering threshold, analyze the network traffic model, and make sure the sum of back pressure frame triggering thresholds on all incoming interfaces does not exceed the tail drop threshold of the outgoing interface. The tail drop threshold of the outgoing interface is configured by using the buffer egress cell queue shared ratio command, 20% by default. In a lossless Ethernet network, the value can be set to 100%. Calculate the proportion of traffic on each incoming interface to the total buffer. Then, obtain the dynamic back pressure frame triggering threshold (in percentage) based on Table 2. At last, adjust the threshold in actual traffic conditions to obtain the optimal value, and configure the value on the incoming interface.

Table 2 Dynamic back pressure frame triggering thresholds

Percentage (dynamic back pressure frame triggering threshold)

Alpha value (ratio of cells in a queue to available cells, configured by the chip register)

Proportion to total buffer

0

1/128

0.77

1

1/64

1.53

2 to 3

1/32

3.03

4 to 5

1/16

5.88

6 to 11

1/8

11.11

12 to 20

1/4

20.00

21 to 33

1/2

33.33

34 to 50

1

50.00

51 to 66

2

66.66

67 to 80

4

80.00

81 to 100

8

88.88

 

Additionally, when PFC is used together with ECN, for ECN to take effect first, make sure the back pressure frame triggering threshold (the number of cell resources) is greater than the high-limit value configured in the queue queue-id [ drop-level drop-level ] low-limit low-limit high-limit high-limit [ discard-probability discard-prob ] command.

Calculate the number of cell resources for the back pressure frame triggering threshold by using the formula: Used=Total*Alpha/(1+N*Alpha), where the parameters are defined as follows:

·     Used—Buffer size used by a single flow when multiple flows are congested.

·     Total—Total buffer of the device, which is 128K cell resources for the S6850 and S9850 switch series and 258K cell resources for the S9820-8C switch series.

·     Alpha—Ratio of cells in a queue to available cells.

·     N—Number of congested flows that can be processed by the network.

Offset between the back pressure frame stopping threshold and triggering threshold

Configure the offset as a value slightly greater than the maximum length of a packet with the specified priority. Too large an offset will slow down the back pressure frame stopping speed and affects the traffic sending speed.

PFC reserved threshold

As a best practice, configure the PFC reserved threshold as a value slightly greater than the maximum length of a packet with the specified priority. You can calculate the PFC reserved threshold value as follows: (maximum length of a packet carrying the specified priority + packet header length (64 bytes) + bytes of one cell resource)/(bytes of one cell resource).

Configuring PFC deadlock detection

About PFC deadlock detection

Typically, when an interface is congested and the PFC back pressure frame triggering threshold is reached, the device sends PFC pause frames to the upstream device. Upon receiving the PFC pause frames, the upstream device stops sending packets. If the back pressure frame triggering threshold is reached on the upstream device, the upstream device sends PFC pause frames to its upstream device, and so on. This process is repeated until the server that sends packets receives the PFC pause frames. Then, the server stops sending packets within the PFC pause time specified in the PFC pause frames. This process reduces the packet sending rate at the source end and eliminates packet loss caused by congestion on network nodes.

In some special conditions, for example, when a link or device fails, a temporary loop might appear during the route convergence period. In this case, devices repeatedly send and receive PFC pause frames, and packets cannot be forwarded. As a result, the cell resources in the interface buffer cannot be released, and the device enters the PFC deadlock state.

As shown in Figure 9, the original path for traffic from Server 1 to Server 2 is Server 1—>Device C—>Device A—>Device D—>Server 2. Suppose the link between Device D and Server 2 fails. During the route convergence period, a forwarding path Server 1—>Device C—>Device A—>Device D—>Device B—>Device C—>Server 2 appears. Therefore, a loop Device C—>Device A—>Device D—>Device B—>Device C exists. If the interface connecting Device C to Server 2 is congested, Device C sends PFC pause frames to the upstream device. Then, these PFC pause frames will be forwarded in the loop mentioned above, and each device along the loop stops sending packets and waits for the peer to release cell resources. In this case, the devices along the loop enter the PFC deadlock state.

Figure 9 How a PFC deadlock is generated

 

To remove the PFC deadlock on these devices, disable PFC or ignore the received PFC pause frames (which are used to notify the devices to stop sending packets) to resume packet sending on the devices, and forward or drop packets in the buffer. The PFC deadlock detection interval can be accurate to the millisecond level, and can reduce the influence of PFC deadlocks efficiently.

Restrictions and guidelines

The specified CoS value must be within the 802.1p priority list specified by using the priority-flow-control no-drop dot1p command. To view the 802.1p priority for each CoS value, execute the display qos map-table dot1p-lp command.

PFC deadlock detection interval

The PFC deadlock detection interval for a CoS value is the product of the interval argument configured by using the priority-flow-control deadlock cos interval command and the precision configured by using the priority-flow-control deadlock precision command. For example, if you execute the priority-flow-control deadlock cos 5 interval 10 command to set the interval argument to 10 for CoS priority 5 and execute the priority-flow-control deadlock precision command to set the precision to high (which represents 10 milliseconds) for CoS priority 5, the PFC deadlock detection interval for CoS priority 5 is 10*10 =100 milliseconds.

Delay timer for PFC deadlock detection automatic recovery

When PFC deadlock is detected on a device, the device ignores received PFC XOFF frames and disables PFC deadlock detection during the delay timer for PFC deadlock detection automatic recovery. Then, packets can be forwarded properly.

The delay timer is 100 milliseconds + delay-interval * 100 milliseconds. For example, if you execute the priority-flow-control deadlock auto-recover cos 5 delay 10 command to set the delay-time argument to 10 for CoS priority 5, the delay timer for CoS priority 5 is 100 milliseconds + 10 * 100 milliseconds =1100 milliseconds.

PFC deadlock recovery mode

After the PFC deadlock is removed on a device, PFC deadlock detection must be recovered, and PFC is recovered at the same time. PFC deadlock detection can be recovered in automatic mode or manual mode.

Typically, use the automatic recovery mode when no serious failures occur. When a packet loop cannot be removed and the device enters the PFC deadlock state frequently, manually recover PFC deadlock detection on an interface as follows:

1.     Set the PFC deadlock detection recovery mode to manual on the interface.

2.     Troubleshoot the device. After the failures are solved, execute the priority-flow-control deadlock recover command to recover the PFC deadlock detection and PFC features on the interface.

After the upper threshold for PFC deadlock times during the specified period is configured, if the PFC deadlock times within the specified period exceed the upper threshold on an interface, the device disables PFC for the corresponding priority on the interface. To recover the PFC feature on the interface in this case, troubleshoot the device and execute the undo priority-flow-control deadlock threshold command after the failures are solved.

Early warning thresholds for PFC packets

You can configure the early warning threshold for incoming or outgoing PFC packets of an interface as needed. The early warning threshold notifies a situation where the PFC packet transmission rate is still within a normal range but needs attention.

When the rate of PFC packets that an interface sends or receives reaches the early warning threshold, the system generates traps and logs to notify the user. According to the traps and logs, the user can discover some exceptions on the network, for example:

·     The NIC of the peer device fails and continuously sends PFC packets at a high speed. In this case, you can set the early warning threshold for incoming PFC packets.

·     The device fails and continuously sends PFC pause frames. In this case, you can set the early warning threshold for outgoing PFC packets.

To monitor bidirectional PFC packets, you can set the early warning thresholds for incoming packets and outgoing packets separately.

ECN

About ECN

ECN is required for building a lossless Ethernet network. ECN defines a traffic control and end-to-end congestion notification mechanism based on the IP layer and transport layer. ECN uses the DS field in the IP header to mark the congestion status along the packet transmission path. An ECN-capable endpoint can determine whether congestion occurs on the transmission path according to the packet contents. Then, the endpoint adjusts the packet sending speed to avoid deteriorating congestion.

Mechanism

ECN defines the last two bits (ECN field) in the DS field of the IP header as follows:

·     Bit 6 indicates whether the sending endpoint device supports ECN, and is called the ECN-Capable Transport (ECT) bit.

·     Bit 7 indicates whether the packet has experienced congestion along the transmission path, and is called the Congestion Experienced (CE) bit.

 

 

NOTE:

In actual applications, the following packets are considered as packets that an ECN-capable endpoint sends:

·     Packets with ECT set to 1 and CE set to 0.

·     Packets with ECT set to 0 and CE set to 1.

 

Figure 10 DS field location

 

Figure 11 ECN field location

 

After you enable ECN on a device, the device processes incoming packets as follows:

·     When the device assigns packets to a queue in the outbound direction, if the average queue size is below the lower threshold (ECN threshold), the device does not process these packets, and directly forwards them out of the outgoing interface.

·     When the device assigns packets to a queue in the outbound direction, if the average queue size is between the lower threshold and the upper threshold, the device performs the following operations:

¡     If the ECN field is 00, which shows that the packets are sent out of an ECN-incapable terminal, the device calculates whether to drop a packet according to the drop probability.

¡     If the ECN field is 01 or 10, which shows that the packets are sent out of an ECN-capable terminal, the device performs the following operations:

-     Sets both the ECT bit and the CE bit to 1 for some of these packets according to the drop probability and forwards these packets.

-     Forwards all incoming packets.

¡     If the ECN field is 11 in a packet, the device forwards the packet without modifying the ECN field. The combination of ECT bit 1 and CE bit 1 indicates that the packet has experienced congestion along the transmission path.

·     When the device assigns packets to a queue in the outbound direction, if the average queue size exceeds the upper threshold, the device performs the following operations:

¡     If the ECN field is 00, which shows that the packets are sent out of an ECN-incapable terminal, the device drops the packets.

¡     If the ECN field is 01 or 10, which shows that the packets are sent out of an ECN-capable terminal, the device performs the following operations:

-     Sets both the ECT bit and the CE bit to 1 and forwards these packets.

-     Forwards all incoming packets.

-     If the ECN field is 11 in a packet, the device forwards the packet without modifying the ECN field. The combination of ECT bit 1 and CE bit 1 indicates that the packet has experienced congestion along the transmission path.

ECN works in the following flow:

1.     The sender sets the ECN field to 10, and notifies the devices along the transmission and the receiver that the sender supports ECN.

2.     When congestion occurs on an intermediate device, the congested device sets the ECN field to 11 for congested packets, and normally forwards the packets.

3.     When the receiver receives packets with the ECN field set to 11, the transport layer sends congestion notification packets (CNPs) to the sender.

4.     When receiving the CNPs, the sender slows down the sending rate of packets with the specified priority.

5.     After a configurable period of time or the specified number of packets are sent, the sender resumes the original sending rate.

Figure 12 ECN working mechanism

 

 

Lossless network configuration examples

Example: Configuring a data center RDMA network

Network configuration

As shown in Figure 13:

·     The RDMA network uses the two-level spine-leaf architecture. The spine devices are S9820-8C switches, and the leaf devices are S6850-56HF devices.

·     The leaf devices act as the gateways of the servers. A Layer 3 ECMP network is built between spine devices and leaf devices. The Layer 3 ECMP network forwards data at a high speed and also supports redundancy of forwarding paths.

·     Simple Multichassis Link Aggregation (S-MLAG) is run between the member devices in the same leaf device group. Each server accesses the leaf devices through a dynamic aggregate interface with two member interfaces.

·     The server NICs use bonding mode 4 (dynamic aggregation mode).

Configure the network, so that RDMA application packets can be lossless transmitted in queue 5.

 

 

NOTE:

This example uses four spine devices and four leaf devices. In actual applications, you can deploy eight S9820-8C switches on the spine layer and 96 S6850-56HF switches on the leaf layer to build an RDMA network with the convergence ratio of 1.5:1.

 

CAUTION

CAUTION:

When the server NICs use bonding mode 4, dynamic aggregation must be configured on access switches. A server needs to support two uplink interfaces that both send ARP packets, so that the two access switches of the server can learn the ARP entries of the server. When any member interface in the aggregation group on a server goes down and then comes up, the server must send ARP packets to the access switches. Then, the switches can update the ARP entries and generate host routes.

When the server NICs use bonding mode 1 (active/backup mode), the requirements are as follows:

·     The access switches can use dynamic or static aggregation. When an interface on a server becomes inactive and then becomes active, the server NIC must be able to send ARP packets to the access switches. Then, the switches can update the ARP entries and generate host routes.

·     When dynamic aggregation is configured on switches, you must execute the lacp edge-port command on the aggregate interfaces on the access switches to configure the interfaces as edge aggregate interfaces.

·     As a best practice, set the ARP entry aging timer to 30 seconds. In this case, when a server NIC interface becomes inactive but does not go down, the switches can fast update ARP entries and generate host routes to reduce the packet loss time. As a best practice to avoid increasing the CPU burden, do not configure too short an ARP entry aging time.

 

Figure 13 Network diagram

Device

Interface

IP address

Device

Interface

IP address

NOTE: The interface numbering rules on Spine 2, Spine 3, and Spine 4 are the same as those on Spine1. The interface numbering rules on Leaf 1-2, Leaf 2-1, and Leaf 2-2 are the same as those on Leaf 1-1. Interface numbers are not all marked to make the network diagram simple.

Leaf 1-1

HGE1/0/25

9.179.56.1/31

Spine 1

HGE1/1/1

9.179.56.0/31

HGE1/0/26

9.179.56.193/31

HGE1/1/2

9.179.56.2/31

HGE1/0/27

9.179.57.129/31

HGE1/1/3

9.179.56.4/31

HGE1/0/28

9.179.58.65/31

HGE1/1/4

9.179.56.6/31

Vlan-int2

9.179.64.1/26

Spine 2

HGE1/1/1

9.179.56.192/31

Leaf 1-2

HGE1/0/25

9.179.56.3/31

HGE1/1/2

9.179.56.194/31

HGE1/0/26

9.179.56.195/31

HGE1/1/3

9.179.56.196/31

HGE1/0/27

9.179.57.131/31

HGE1/1/4

9.179.56.198/31

HGE1/0/28

9.179.58.67/31

Spine 3

HGE1/1/1

9.179.57.128/31

Vlan-int2

9.179.64.1/26

HGE1/1/2

9.179.57.130/31

Leaf 1-1

HGE1/0/25

9.179.56.5/31

HGE1/1/3

9.179.57.132/31

HGE1/0/26

9.179.56.197/31

HGE1/1/4

9.179.57.134/31

HGE1/0/27

9.179.57.133/31

Spine 4

HGE1/1/1

9.179.58.64/31

HGE1/0/28

9.179.58.69/31

HGE1/1/2

9.179.58.66/31

Vlan-int2

9.179.64.64/26

HGE1/1/3

9.179.58.68/31

Leaf 1-2

HGE1/0/25

9.179.56.7/31

HGE1/1/4

9.179.58.70/31

HGE1/0/26

9.179.56.199/31

N/A

HGE1/0/27

9.179.57.135/31

HGE1/0/28

9.179.58.71/31

Vlan-int2

9.179.64.64/26

 

Analysis

Dual-uplink on server

Configure S-MLAG on the leaf devices to enable each server to access through two uplinks. (The network requirements forbid using the IRF technology.)

Configure ARP to generate 32-bit host routes on the leaf devices. When the link to a member device fails, the routes can be promptly updated to switch traffic to the other member device in the same leaf device group.

When a server sends ARP packets to request the MAC address of a server on the same subnet and connected to the same leaf device group, the leaf devices will flood the ARP packets. To suppress broadcast packets on a leaf device, perform the following tasks:

·     Configure a QoS policy on the leaf device to allow only ARP packets sent by the VLAN interface on the device's service subnet and deny other ARP packets.

·     Enable local ARP proxy for the VLAN interface of the VLAN where the service packets of the leaf device are sent. Then, the hosts on the same subnet and connected to the same leaf device group can communicate at Layer 3.

Configure monitor link on each leaf device. When all the uplink interfaces go down, the downlink interfaces connecting to the servers also go down. Then, traffic can be switched to the other member device in the same leaf device group to protect service traffic.

Route settings

Configure BGP. Configure the AS number as 805 on the spine devices. Configure different AS numbers on different member leaf devices (AS 5001, AS 5002, AS 5003... in this example), so that a leaf device can learn routes advertised by other leaf devices. When a leaf device receives routes advertised by a BGP neighbor, it will check the AS_PATH attribute in the routes. If the AS_PATH attribute of a route contains the leaf device's AS number, the leaf device will not learn the route.

Configure routes and routing policies, so that the spine devices can learn all network routes and host routes and each leaf device can learn network routes and the host routes of the other member device in the same group to save the route resources.

RDMA settings

To lossless transmit RDMA application packets, configure PFC and ECN:

·     PFC performs flow control for packets based on their priority queues. RDMA packets carry 802.1p priority 5. Therefore, enable PFC for 802.1p priority 5.

PFC must be enabled on all interfaces along the RDMA packet forwarding path. Therefore, enable PFC on all interfaces connecting spine devices and leaf devices and interfaces connecting leaf devices to servers.

·     ECN provides end-to-end congestion control. After congestion is detected on the device, the ECN field is marked for related packets. When the receiver receives packets with the ECN field marked, the receiver sends the congestion notification packets to the sender to notify the sender to decrease the rate of sending packets. In this example, enable ECN on all interfaces connecting spine devices and leaf devices and interfaces connecting leaf devices to servers.

For ECN to take effect first, make sure the high-limit value specified in the queue queue-id [ drop-level drop-level ] low-limit low-limit high-limit high-limit [ discard-probability discard-prob ]) command is smaller than the PFC back pressure frame triggering threshold.

Restrictions and guidelines

For PFC to operate normally on an interface, follow these guidelines when configuring PFC:

·     Make sure the interface is configured to trust the 802.1p priority or DSCP value carried in packets by using the qos trust { dot1p | dscp } command.

·     Make sure the 802.1p-local priority map and DSCP-802.1p priority map are the same on all interfaces along the forwarding path.

On two member devices in the same leaf device group, the VLAN interfaces acting as the gateways must be configured with the same IP address and MAC address.

Procedures

Assigning IP addresses to interfaces

1.     Configure Leaf 1-1:

# Set the link mode of HundredGigE 1/0/25 through HundredGigE 1/0/28 to Layer 3, and assign an IP address to each interface.

<Leaf 1-1> system-view

[Leaf 1-1] interface hundredgige 1/0/25

[Leaf 1-1-HundredGigE1/0/25] port link-mode route

[Leaf 1-1-HundredGigE1/0/25] ip address 9.179.56.1 31

[Leaf 1-1-HundredGigE1/0/25] quit

[Leaf 1-1] interface hundredgige 1/0/26

[Leaf 1-1-HundredGigE1/0/26] port link-mode route

[Leaf 1-1-HundredGigE1/0/26] ip address 9.179.56.193 31

[Leaf 1-1-HundredGigE1/0/26] quit

[Leaf 1-1] interface hundredgige 1/0/27

[Leaf 1-1-HundredGigE1/0/27] port link-mode route

[Leaf 1-1-HundredGigE1/0/27] ip address 9.179.57.129 31

[Leaf 1-1-HundredGigE1/0/27] quit

[Leaf 1-1] interface hundredgige 1/0/28

[Leaf 1-1-HundredGigE1/0/28] port link-mode route

[Leaf 1-1-HundredGigE1/0/28] ip address 9.179.58.65 31

[Leaf 1-1-HundredGigE1/0/28] quit

# Create VLAN 2. Assign an IP address and MAC address to VLAN-interface 2.

[Leaf 1-1] vlan 2

[Leaf 1-1] interface vlan-interface 2

[Leaf 1-1-Vlan-interface2] 9.179.64.1 26

[Leaf 1-1-Vlan-interface2] mac-address 0000-5e00-0001

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Set the link mode of each device interface to Layer 3, and assign an IP address to each interface as shown in Figure 13. (Details not shown.)

NOTE: VLAN-interface 2 on Leaf 1-2 and VLAN-interface 2 on Leaf 1-1 must be assigned the same IP address and MAC address. This rule also applies to Leaf 2-1 and Leaf 2-2.

3.     Configure Spine 1:

# Set the link mode of HundredGigE 1/1/1 through HundredGigE 1/1/4 to Layer 3, and assign an IP address to each interface.

<Spine 1> system-view

[Spine 1] interface hundredgige 1/1/1

[Spine 1-HundredGigE1/1/1] port link-mode route

[Spine 1-HundredGigE1/1/1] ip address 9.179.56.0 31

[Spine 1-HundredGigE1/1/1] quit

[Spine 1] interface hundredgige 1/1/2

[Spine 1-HundredGigE1/1/2] port link-mode route

[Spine 1-HundredGigE1/1/2] ip address 9.179.56.2 31

[Spine 1-HundredGigE1/1/2] quit

[Spine 1] interface hundredgige 1/1/3

[Spine 1-HundredGigE1/1/3] port link-mode route

[Spine 1-HundredGigE1/1/3] ip address 9.179.56.4 31

[Spine 1-HundredGigE1/1/37] quit

[Spine 1] interface hundredgige 1/1/4

[Spine 1-HundredGigE1/1/4] port link-mode route

[Spine 1-HundredGigE1/1/4] ip address 9.179.56.6 31

[Spine 1-HundredGigE1/1/4] quit

4.     Configure Spine 2, Spine 3, and Spine 4:

Configure these devices in the same way Spine 1 is configured. Assign IP addresses to interfaces as shown in Figure 13. (Details not shown.)

Configuring S-MLAG on leaf devices

1.     Configure Leaf 1-1:

# Set the LACP system MAC address to 0001-0001-0001. All S-MLAG devices in the same group must use the same LACP system MAC address.

<Leaf 1-1> system-view

[Leaf 1-1] lacp system-mac 1-1-1

# Set the LACP system priority to 123. All S-MLAG devices in the same group must use the same LACP system priority.

[Leaf 1-1] lacp system-priority 123

# Set the LACP system number to 1. You must assign a unique LACP system number to each S-MLAG device in the same group.

[Leaf 1-1] lacp system-number 1

# Create Layer 2 aggregate interface Bridge-Aggregation 1 and set its aggregation mode to dynamic.

[Leaf 1-1] interface bridge-aggregation 1

[Leaf 1-1-Bridge-Aggregation1] link-aggregation mode dynamic

[Leaf 1-1-Bridge-Aggregation1] quit

# Add Twenty-FiveGigE 1/0/1 to aggregation group 1. Enable the short LACP timeout interval (3 seconds) on the interface.

[Leaf 1-1] interface twenty-fivegige 1/0/1

[Leaf 1-1-Twenty-FiveGigE1/0/1] port link-aggregation group 1

[Leaf 1-1-Twenty-FiveGigE1/0/1] lacp period short

[Leaf 1-1-Twenty-FiveGigE1/0/1] quit

# Set the link type of Layer 2 aggregate interface 1 to trunk. Assign it to VLAN 2, remove it from VLAN 1, and configure its PVID as VLAN 2.

[Leaf 1-1] interface bridge-aggregation 1

[Leaf 1-1-Bridge-Aggregation1] port link-type trunk

[Leaf 1-1-Bridge-Aggregation1] undo port trunk permit vlan 1

[Leaf 1-1-Bridge-Aggregation1] port trunk permit vlan 2

[Leaf 1-1-Bridge-Aggregation1] port trunk pvid vlan 2

# Assign Layer 2 aggregate interface 1 to S-MLAG group 1.

[Leaf 1-1-Bridge-Aggregation1] port s-mlag group 1

[Leaf 1-1-Bridge-Aggregation1] quit

# Create Layer 2 aggregate interface Bridge-Aggregation 2 and set its aggregation mode to dynamic.

[Leaf 1-1] interface bridge-aggregation 2

[Leaf 1-1-Bridge-Aggregation2] link-aggregation mode dynamic

[Leaf 1-1-Bridge-Aggregation2] quit

# Assign Twenty-FiveGigE 1/0/2 to aggregation group 2. Enable the short LACP timeout interval (3 seconds) on the interface.

[Leaf 1-1] interface twenty-fivegige 1/0/2

[Leaf 1-1-Twenty-FiveGigE1/0/2] port link-aggregation group 2

[Leaf 1-1-Twenty-FiveGigE1/0/2] lacp period short

[Leaf 1-1-Twenty-FiveGigE1/0/2] quit

# Set the link type of Layer 2 aggregate interface 2 to trunk. Assign it to VLAN 2, remove it from VLAN 1, and configure its PVID as VLAN 2.

[Leaf 1-1] interface bridge-aggregation 2

[Leaf 1-1-Bridge-Aggregation2] port link-type trunk

[Leaf 1-1-Bridge-Aggregation2] undo port trunk permit vlan 1

[Leaf 1-1-Bridge-Aggregation2] port trunk permit vlan 2

[Leaf 1-1-Bridge-Aggregation2] port trunk pvid vlan 2

# Assign Layer 2 aggregate interface 2 to S-MLAG group 2.

[Leaf 1-1-Bridge-Aggregation2] port s-mlag group 2

[Leaf 1-1-Bridge-Aggregation2] quit

2.     Configure Leaf 1-2:

# Set the LACP system MAC address to 0001-0001-0001. All S-MLAG devices in the same group must use the same LACP system MAC address.

<Leaf 1-2> system-view

[Leaf 1-2] lacp system-mac 1-1-1

# Set the LACP system priority to 123. All S-MLAG devices in the same group must use the same LACP system priority.

[Leaf 1-2] lacp system-priority 123

# Set the LACP system number to 2. You must assign a unique LACP system number to each S-MLAG device in the same group.

[Leaf 1-2] lacp system-number 2

# Create Layer 2 aggregate interface Bridge-Aggregation 1 and set its aggregation mode to dynamic.

[Leaf 1-2] interface bridge-aggregation 1

[Leaf 1-2-Bridge-Aggregation1] link-aggregation mode dynamic

[Leaf 1-2-Bridge-Aggregation1] quit

# Add Twenty-FiveGigE 1/0/1 to aggregation group 1. Enable the short LACP timeout interval (3 seconds) on the interface.

[Leaf 1-2] interface twenty-fivegige 1/0/1

[Leaf 1-2-Twenty-FiveGigE1/0/1] port link-aggregation group 1

[Leaf 1-2-Twenty-FiveGigE1/0/1] lacp period short

[Leaf 1-2-Twenty-FiveGigE1/0/1] quit

# Set the link type of Layer 2 aggregate interface 1 to trunk. Assign it to VLAN 2, remove it from VLAN 1, and configure its PVID as VLAN 2.

[Leaf 1-2] interface bridge-aggregation 1

[Leaf 1-2-Bridge-Aggregation1] port link-type trunk

[Leaf 1-2-Bridge-Aggregation1] undo port trunk permit vlan 1

[Leaf 1-2-Bridge-Aggregation1] port trunk permit vlan 2

[Leaf 1-2-Bridge-Aggregation1] port trunk pvid vlan 2

# Assign Layer 2 aggregate interface 1 to S-MLAG group 1.

[Leaf 1-2-Bridge-Aggregation1] port s-mlag group 1

[Leaf 1-2-Bridge-Aggregation1] quit

# Create Layer 2 aggregate interface Bridge-Aggregation 2 and set its aggregation mode to dynamic.

[Leaf 1-2] interface bridge-aggregation 2

[Leaf 1-2-Bridge-Aggregation2] link-aggregation mode dynamic

[Leaf 1-2-Bridge-Aggregation2] quit

# Assign Twenty-FiveGigE 1/0/2 to aggregation group 2. Enable the short LACP timeout interval (3 seconds) on the interface.

[Leaf 1-2] interface twenty-fivegige 1/0/2

[Leaf 1-2-Twenty-FiveGigE1/0/2] port link-aggregation group 2

[Leaf 1-2-Twenty-FiveGigE1/0/2] lacp period short

[Leaf 1-2-Twenty-FiveGigE1/0/2] quit

# Set the link type of Layer 2 aggregate interface 2 to trunk. Assign it to VLAN 2, remove it from VLAN 1, and configure its PVID as VLAN 2.

[Leaf 1-2] interface bridge-aggregation 2

[Leaf 1-2-Bridge-Aggregation2] port link-type trunk

[Leaf 1-2-Bridge-Aggregation2] undo port trunk permit vlan 1

[Leaf 1-2-Bridge-Aggregation2] port trunk permit vlan 2

[Leaf 1-2-Bridge-Aggregation2] port trunk pvid vlan 2

# Assign Layer 2 aggregate interface 2 to S-MLAG group 2.

[Leaf 1-2-Bridge-Aggregation2] port s-mlag group 2

[Leaf 1-2-Bridge-Aggregation2] quit

3.     Configure Leaf 2-1 and Leaf 2-2:

Configure Leaf 2-1 and Leaf 2-2 in the same way Leaf 1-1 and Leaf 1-2 are configured. (Details not shown.)

Enabling local proxy APP and broadcast/multicast/unknown unicast storm suppression on user-facing interfaces on leaf devices

1.     Configure Leaf 1-1:

# Configure broadcast/multicast/unknown unicast storm suppression on Twenty-FiveGigE 1/0/1 and Twenty-FiveGigE 1/0/2. These features prevent interfaces from forwarding too much flooding traffic and avoid impact on traffic forwarding.

[Leaf 1-1] interface range twenty-fivegige 1/0/1 twenty-fivegige 1/0/2

[Leaf 1-1-if-range] broadcast-suppression pps 0

[Leaf 1-1-if-range] multicast-suppression pps 0

[Leaf 1-1-if-range] unicast-suppression pps 0

# Enable ARP packet rate limit on Twenty-FiveGigE 1/0/1 and Twenty-FiveGigE 1/0/2, and set the maximum ARP packet rate to 50 pps. This feature prevents ARP attacks.

[Leaf 1-1-if-range] arp rate-limit 50

[Leaf 1-1-if-range] quit

# Enable local proxy ARP for VLAN-interface 2. Then, the hosts on the same subnet can communicate.

[Leaf 1-1] interface vlan-interface 2

[Leaf 1-1-Vlan-interface2] local-proxy-arp enable

#  Enable ARP direct route advertisement on VLAN-interface 2.

[Leaf 1-1-Vlan-interface2] arp route-direct advertise

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Configure these devices in the same way Leaf 1-1 is configured. (Details not shown.)

Configuring global QoS policies on each leaf device, and forbidding Layer 2 communication among servers on the same subnet and connected to the same leaf device group

1.     Configure Leaf 1-1:

# Configure ACL 4001 to match ARP packets. Create traffic class Arp-Forwarding. Configure the traffic class to match packets with source MAC address 0000-5e00-0001 (MAC address of VLAN-interface 2) and matching ACL 4001, namely, ARP packets sent by VLAN-interface 2.

[Leaf 1-1] acl mac 4001

[Leaf 1-1-acl-mac-4000] rule 0 permit type 0806 ffff

[Leaf 1-1-acl-mac-4000] quit

[Leaf 1-1] traffic classifier Arp-Forwarding operator and

[Leaf 1-1-classifier-Arp-Forwarding] if-match source-mac 0000-5e00-0001

[Leaf 1-1-classifier-Arp-Forwarding] if-match acl mac 4001

[Leaf 1-1-classifier-Arp-Forwarding] quit

# Create traffic class Arp-No-Forwarding, and use ACL 4001 as a match criterion.

[Leaf 1-1] traffic classifier Arp-No-Forwarding operator and

[Leaf 1-1-classifier-Arp-No-Forwarding] if-match acl 4001

[Leaf 1-1-classifier-Arp-No-Forwarding] quit

# Create traffic behavior Forwarding, and configure an action to forward matching packets.

[Leaf 1-1] traffic behavior Forwarding

[Leaf 1-1-behavior-Forwarding] filter permit

[Leaf 1-1-behavior-Forwarding] quit

# Create traffic behavior No-Forwarding, and configure an action to drop matching packets.

[Leaf 1-1] traffic behavior No-Forwarding

[Leaf 1-1-behavior-No-Forwarding] filter deny

[Leaf 1-1-behavior-No-Forwarding] quit

# Create QoS policy Arp-Filter. In the QoS policy, associate traffic class Arp-Forwarding with traffic behavior Forwarding, and associate traffic class Arp-No-Forwarding with traffic behavior No-Forwarding.

[Leaf 1-1] qos policy Arp-Filter

[Leaf 1-1-qospolicy-Arp-Filter] classifier Arp-Forwarding behavior Forwarding

[Leaf 1-1-qospolicy-Arp-Filter] classifier Arp-No-Forwarding behavior No-Forwarding

[Leaf 1-1-qospolicy-Arp-Filter] quit

# Apply QoS policy Arp-Filter to the inbound direction globally.

[Leaf 1-1] qos apply policy Arp-Filter global outbound

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2 in the same way Leaf 1-1 is configured. (Details not shown.)

Configuring monitor link on leaf devices

1.     Configure Leaf 1-1:

# Create monitor link group 1, and set the switchover delay for the downlink interfaces in the monitor link group to 120 seconds.

<Leaf 1-1> system-view

[Leaf 1-1] monitor-link group 1

[Leaf 1-1-mtlk-group1] downlink up-delay 120

[Leaf 1-1-mtlk-group1] quit

# Configure all interfaces connecting to spine devices as uplink interfaces, and configure all interfaces connecting to the servers as downlink interfaces. Then, when an uplink interface on the leaf device goes down, traffic will be switched between the downlink aggregation member ports connecting to the servers.

[Leaf 1-1] interface hundredgige 1/0/25

[Leaf 1-1-HundredGigE1/0/25] port monitor-link group 1 uplink

[Leaf 1-1-HundredGigE1/0/25] quit

[Leaf 1-1] interface hundredgige 1/0/26

[Leaf 1-1-HundredGigE1/0/26] port monitor-link group 1 uplink

[Leaf 1-1-HundredGigE1/0/26] quit

[Leaf 1-1] interface hundredgige 1/0/27

[Leaf 1-1-HundredGigE1/0/27] port monitor-link group 1 uplink

[Leaf 1-1-HundredGigE1/0/27] quit

[Leaf 1-1] interface hundredgige 1/0/28

[Leaf 1-1-HundredGigE1/0/28] port monitor-link group 1 uplink

[Leaf 1-1-HundredGigE1/0/28] quit

[Leaf 1-1] interface twenty-fivegige 1/0/1

[Leaf 1-1-Twenty-FiveGigE1/0/1] port monitor-link group 1 downlink

[Leaf 1-1-Twenty-FiveGigE1/0/1] quit

[Leaf 1-1] interface twenty-fivegige 1/0/2

[Leaf 1-1-Twenty-FiveGigE1/0/2] port monitor-link group 1 downlink

[Leaf 1-1-Twenty-FiveGigE1/0/2] quit

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2 in same way Leaf 1-1 is configured. (Details not shown.)

Configuring BGP

1.     Configure Leaf 1-1:

# Enable BGP instance default, set the local AS number to 5001, and enter BGP instance view.

<Leaf 1-1> system-view

[Leaf 1-1] bgp 5001

# Set the global router ID for BGP to 1.1.1.1.

[Leaf 1-1-bgp-default] router-id 1.1.1.1

# Enable MED comparison for routes from peers in different ASs.

[Leaf 1-1-bgp-default] compare-different-as-med

# Create EBGP peer group Core.

[Leaf 1-1-bgp-default] group Core external

# Specify AS number 805 for peer group Core.

[Leaf 1-1-bgp-default] peer Core as-number 805

# Set the interval to 0 for sending the same update to peer group Core.

[Leaf 1-1-bgp-default] peer Core route-update-interval 0

# Add the spine devices to peer group Core.

[Leaf 1-1-bgp-default] peer 9.179.56.0 group Core

[Leaf 1-1-bgp-default] peer 9.179.56.192 group Core

[Leaf 1-1-bgp-default] peer 9.179.57.128 group Core

[Leaf 1-1-bgp-default] peer 9.179.58.64 group Core

# Enter BGP IPv4 unicast address family view.

[Leaf 1-1-bgp-default] address-family ipv4 unicast

# Enable load balancing and set the maximum number of BGP ECMP routes for load balancing to 32.

[Leaf 1-1-bgp-default-ipv4] balance 32

# In BGP IPv4 unicast address family view, set preferences for EBGP, IBGP, and local routes to 10, 100, and 100, respectively.

[Leaf 1-1-bgp-default-ipv4] preference 10 100 100

# Advertise the COMMUNITY attribute to peer group Core.

[Leaf 1-1-bgp-default-ipv4] peer Core advertise-community

[Leaf 1-1-bgp-default-ipv4] quit

# In BGP IPv4 unicast address family view, enable BGP to exchange IPv4 unicast routing information with peer group Core.

[Leaf 1-1-bgp-default-ipv4] peer Core enable

[Leaf 1-1-bgp-default-ipv4] quit

[Leaf 1-1-bgp-default] quit

# Create routing policy SetComm-1/1, and set the COMMUNITY attribute for BGP routes to 1:1.

[Leaf 1-1] route-policy SetComm-1/1 permit node 5

[Leaf 1-1-route-policy-SetComm-1/1-5] apply community 1:1

[Leaf 1-1-route-policy-SetComm-1/1-5] quit

# In BGP IPv4 unicast address family view, inject local network 9.179.64.0/26 to the BGP routing table, and apply routing policy SetComm-1/1.

[Leaf 1-1] bgp 5001

[Leaf 1-1-bgp-default] address-family ipv4 unicast

[Leaf 1-1-bgp-default-ipv4] network 9.179.64.0 255.255.255.192 route-policy SetComm-1/1

[Leaf 1-1-bgp-default-ipv4] quit

[Leaf 1-1-bgp-default] quit

# Configure IPv4 prefix list Host-Route to permit routes with mask length 32 (host routes).

[Leaf 1-1] ip prefix-list Host-Route index 5 permit 0.0.0.0 0 greater-equal 32

# Create routing policy From-Direct. Configure the routing policy to set the COMMUNITY attribute to 1:5001 for BGP routes matching IPv4 prefix list Host-Route.

[Leaf 1-1] route-policy From-Direct permit node 10

[Leaf 1-1-route-policy-From-Direct-10] if-match ip address prefix-list Host-Route

[Leaf 1-1-route-policy-From-Direct-10] apply community 1:5001

[Leaf 1-1-route-policy-From-Direct-10] quit

# Configure basic community list 11 to permit routes with the COMMUNITY attribute 1:1 or 1:5002.

[Leaf 1-1] ip community-list 11 permit 1:1

[Leaf 1-1] ip community-list 11 permit 1:5002

# Create routing policy From-Core. Configure the routing policy to match BGP routes whose COMMUNITY attribute matches basic community list 11.

[Leaf 1-1] route-policy From-Core permit node 5

[Leaf 1-1-route-policy-From-Core-5] if-match community 11

[Leaf 1-1-route-policy-From-Core-5] quit

# Redistribute direct routes matching routing policy From-Direct (direct routes with the COMMUNITY attribute 1:5001).

[Leaf 1-1] bgp 5001

[Leaf 1-1-bgp-default] address-family ipv4 unicast

[Leaf 1-1-bgp-default-ipv4] import-route direct route-policy From-Direct

# In BGP IPv4 unicast address family view, apply routing policy From-Core to routes incoming from peer group Core (receive the service subnet routes advertised by spine devices and host routes advertised by the other leaf device in the same group).

[Leaf 1-1-bgp-default-ipv4] peer Core route-policy From-Core import

[Leaf 1-1-bgp-default-ipv4] quit

[Leaf 1-1-bgp-default] quit

2.     Configure Leaf 1-2:

# Enable BGP instance default, set the local AS number to 5002, and enter BGP instance view.

<Leaf 1-2> system-view

[Leaf 1-2] bgp 5002

# Set the global router ID for BGP to 2.2.2.2.

[Leaf 1-2-bgp-default] router-id 2.2.2.2

# Enable MED comparison for routes from peers in different ASs.

[Leaf 1-2-bgp-default] compare-different-as-med

# Create EBGP peer group Core.

[Leaf 1-2-bgp-default] group Core external

# Specify AS number 805 for peer group Core.

[Leaf 1-2-bgp-default] peer Core as-number 805

# Set the interval to 0 for sending the same update to peer group Core.

[Leaf 1-2-bgp-default] peer Core route-update-interval 0

# Add the spine devices to peer group Core.

[Leaf 1-2-bgp-default] peer 9.179.56.2 group Core

[Leaf 1-2-bgp-default] peer 9.179.56.194 group Core

[Leaf 1-2-bgp-default] peer 9.179.57.130 group Core

[Leaf 1-2-bgp-default] peer 9.179.58.66 group Core

# Enter BGP IPv4 unicast address family view.

[Leaf 1-2-bgp-default] address-family ipv4 unicast

# Enable load balancing and set the maximum number of BGP ECMP routes for load balancing to 32.

[Leaf 1-2-bgp-default-ipv4] balance 32

# In BGP IPv4 unicast address family view, set preferences for EBGP, IBGP, and local routes to 10, 100, and 100, respectively.

[Leaf 1-2-bgp-default-ipv4] preference 10 100 100

# Advertise the COMMUNITY attribute to peer group Core.

[Leaf 1-2-bgp-default-ipv4] peer Core advertise-community

[Leaf 1-2-bgp-default-ipv4] quit

# In BGP IPv4 unicast address family view, enable BGP to exchange IPv4 unicast routing information with peer group Core.

[Leaf 1-2-bgp-default-ipv4] peer Core enable

[Leaf 1-2-bgp-default-ipv4] quit

[Leaf 1-2-bgp-default] quit

# Create routing policy SetComm-1/1, and set the COMMUNITY attribute for BGP routes to 1:1.

[Leaf 1-2] route-policy SetComm-1/1 permit node 5

[Leaf 1-2-route-policy-SetComm-1/1-5] apply community 1:1

[Leaf 1-2-route-policy-SetComm-1/1-5] quit

# In BGP IPv4 unicast address family view, inject local network 9.179.64.0/32 to the BGP routing table, and apply routing policy SetComm-1/1.

[Leaf 1-2] bgp 5002

[Leaf 1-2-bgp-default] address-family ipv4 unicast

[Leaf 1-2-bgp-default-ipv4] network 9.179.64.0 255.255.255.192 route-policy SetComm-1/1

[Leaf 1-2-bgp-default-ipv4] quit

[Leaf 1-2-bgp-default] quit

# Configure IPv4 prefix list Host-Route to permit routes with mask length 32 (host routes).

[Leaf 1-2] ip prefix-list Host-Route index 5 permit 0.0.0.0 0 greater-equal 32

# Create routing policy From-Direct. Configure the routing policy to set the COMMUNITY attribute to 1:5002 for BGP routes matching IPv4 prefix list Host-Route.

[Leaf 1-2] route-policy From-Direct permit node 10

[Leaf 1-2-route-policy-From-Direct-10] if-match ip address prefix-list Host-Route

[Leaf 1-2-route-policy-From-Direct-10] apply community 1:5002

[Leaf 1-2-route-policy-From-Direct-10] quit

# Configure basic community list 11 to permit routes with the COMMUNITY attribute 1:1 or 1:5001.

[Leaf 1-2] ip community-list 11 permit 1:1

[Leaf 1-2] ip community-list 11 permit 1:5001

# Create routing policy From-Core. Configure the routing policy to match BGP routes whose COMMUNITY attribute matches basic community list 11.

[Leaf 1-2] route-policy From-Core permit node 5

[Leaf 1-2-route-policy-From-Core-5] if-match community 11

[Leaf 1-2-route-policy-From-Core-5] quit

# Redistribute direct routes matching routing policy From-Direct (direct routes with the COMMUNITY attribute 1:5002).

[Leaf 1-2] bgp 5002

[Leaf 1-2-bgp-default] address-family ipv4 unicast

[Leaf 1-2-bgp-default-ipv4] import-route direct route-policy From-Direct

# In BGP IPv4 unicast address family view, apply routing policy From-Core to routes incoming from peer group Core (receive the service subnet routes advertised by spine devices and host routes advertised by the other leaf device in the same group).

[Leaf 1-2-bgp-default-ipv4] peer Core route-policy From-Core import

[Leaf 1-2-bgp-default-ipv4] quit

[Leaf 1-2-bgp-default] quit

3.     Configure Leaf 2-1 and Leaf 2-2:

Configure Leaf 2-1 and Leaf 2-2 in the same way Leaf 1-1 and Leaf 1-2 are configured. (Details not shown.)

4.     Configure Spine 1:

# Enable BGP instance default, set the local AS number to 805, and enter BGP instance view.

<Spine 1> system-view

[Spine 1] bgp 805

# Set the global router ID for BGP to 100.100.100.100.

[Spine 1-bgp-default] router-id 100.100.100.100

# Enable MED comparison for routes from peers in different ASs.

[Spine 1-bgp-default] compare-different-as-med

# Configure BGP to delay sending route updates when it restores after a device reboot and set the delay time to 300 seconds.

[Spine 1-bgp-default] bgp update-delay on-startup 300

# Create EBGP peer group LEAF25G.

[Spine 1-bgp-default] group LEAF25G external

# Set the interval to 0 for sending the same update to peer group LEAF25G.

[Spine 1-bgp-default] peer LEAF25G route-update-interval 0

# Add the leaf devices to peer group LEAF25G.

[Spine 1-bgp-default] peer 9.179.56.1 as-number 5001

[Spine 1-bgp-default] peer 9.179.56.1 group LEAF25G

[Spine 1-bgp-default] peer 9.179.56.3 as-number 5002

[Spine 1-bgp-default] peer 9.179.56.3 group LEAF25G

[Spine 1-bgp-default] peer 9.179.56.5 as-number 5003

[Spine 1-bgp-default] peer 9.179.56.5 group LEAF25G

[Spine 1-bgp-default] peer 9.179.56.7 as-number 5004

[Spine 1-bgp-default] peer 9.179.56.7 group LEAF25G

# Enter BGP IPv4 unicast address family view.

[Spine 1-bgp-default] address-family ipv4 unicast

# Enable load balancing and set the maximum number of BGP ECMP routes for load balancing to 32.

[Spine 1-bgp-default-ipv4] balance 32

# Enable load balancing for EBGP routes that have different AS_PATH attributes of the same length. After this command is executed, even if two member devices in the same leaf device group belong to different ASs, the host routes advertised by the two devices can form ECMP routes.

[Spine 1-bgp-default-ipv4] balance as-path-relax

# In BGP IPv4 unicast address family view, set preferences for EBGP, IBGP, and local routes to 10, 100, and 100, respectively.

[Spine 1-bgp-default-ipv4] preference 10 100 100

# In BGP IPv4 unicast address family view, enable BGP to exchange IPv4 unicast routing information with peer group LEAF25G.

[Spine 1-bgp-default-ipv4] peer LEAF25G enable

[Spine 1-bgp-default-ipv4] quit

[Spine 1-bgp-default] quit

5.     Configure Spine 2, Spine 3, and Spine 4:

Configure Spine 2, Spine 3, and Spine 4 in the same way Spine 1 is configured. (Details not shown.)

Configuring ECN

1.     Configure Leaf 1-1:

# Configure both queue 5 and queue 6 to use up to 100% shared-area space of cell resources in the egress buffer. Queue 5 is used to forward RDMA application packets, and queue 6 is used to forward CNP packets.

[Leaf 1-1] buffer egress cell queue 5 shared ratio 100

[Leaf 1-1] buffer egress cell queue 6 shared ratio 100

[Leaf 1-1] buffer apply

# Create WRED table QOS-EGRESS-100G-PORT. Configure the lower limit as 1000, the upper limit as 2000, and the drop probability as 20% for queue 5. Configure the lower limit as 3000, the upper limit as 15000, and the drop probability as 20% for queues 0, 1, 2, 3, 4, and 6. Configure the lower limit as 37999, the upper limit as 38000, and the drop probability as 20% for queue 7. Set the exponent for WRED to calculate the average queue size to 0 for queue 5. Enable ECN for queues 5 and 6.

[Leaf 1-1] qos wred queue table QOS-EGRESS-100G-PORT

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 0 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 1 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 2 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 3 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 4 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 5 low-limit 1000 high-limit 2000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 6 low-limit 3000 high-limit 15000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 7 low-limit 37999 high-limit 38000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 5 weighting-constant 0

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 5 ecn

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] queue 6 ecn

[Leaf 1-1-wred-table-QOS-EGRESS-100G-PORT] quit

# Create WRED table QOS-EGRESS-25G-PORT. Configure the lower limit as 400, the upper limit as 1625, and the drop probability as 20% for queue 5. Configure the lower limit as 2500, the upper limit as 10000, and the drop probability as 20% for queues 0, 1, 2, 3, 4, and 6. Configure the lower limit as 37999, the upper limit as 38000, and the drop probability as 20% for queue 7. Set the exponent for WRED to calculate the average queue size to 0 for queue 5. Enable ECN for queues 5 and 6.

[Leaf 1-1] qos wred queue table QOS-EGRESS-25G-PORT

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 0 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 1 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 2 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 3 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 4 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 5 low-limit 400 high-limit 1625 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 6 low-limit 2500 high-limit 10000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 7 low-limit 37999 high-limit 38000 discard-probability 20

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 5 weighting-constant 0

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 5 ecn

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] queue 6 ecn

[Leaf 1-1-wred-table-QOS-EGRESS-25G-PORT] quit

# Apply WRED table QOS-EGRESS-100G-PORT to interfaces HundredGigE 1/0/25 through HundredGigE 1/0/28. On these interfaces, enable byte-count WFQ, configure queue 5 with the scheduling weight of 60, and configure queues 6 and 7 to use SP queuing. Configure GTS for packets in queue 6. Set the CIR to 50000000 kbps and the CBS to 16000000 bytes.

[Leaf 1-1] interface range hundredgige 1/0/25 to hundredgige 1/0/28

[Leaf 1-1-if-range] qos wred apply QOS-EGRESS-100G-PORT

[Leaf 1-1-if-range] qos wfq byte-count

[Leaf 1-1-if-range] qos wfq ef group 1 byte-count 60

[Leaf 1-1-if-range] qos wfq cs6 group sp

[Leaf 1-1-if-range] qos wfq cs7 group sp

[Leaf 1-1-if-range] qos gts queue 6 cir 50000000 cbs 16000000

[Leaf 1-1-if-range] quit

# Apply WRED table QOS-EGRESS-25G-PORT to interfaces Twenty-FiveGigE 1/0/1 and Twenty-FiveGigE 1/0/2. On these interfaces, enable byte-count WFQ, configure queue 5 with the scheduling weight of 60, and configure queues 6 and 7 to use SP queuing. Configure GTS for packets in queue 6. Set the CIR to 12500000 kbps and the CBS to 16000000 bytes.

[Leaf 1-1] interface range twenty-fivegige 1/0/1 to twenty-fivegige 1/0/2

[Leaf 1-1-if-range] qos wred apply QOS-EGRESS-25G-PORT

[Leaf 1-1-if-range] qos wfq byte-count

[Leaf 1-1-if-range] qos wfq ef group 1 byte-count 60

[Leaf 1-1-if-range] qos wfq cs6 group sp

[Leaf 1-1-if-range] qos wfq cs7 group sp

[Leaf 1-1-if-range] qos gts queue 6 cir 12500000 cbs 16000000

[Leaf 1-1-if-range] quit

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2 in same way Leaf 1-1 is configured. (Details not shown.)

3.     Configure Spine 1:

# Configure both queue 5 and queue 6 to use up to 100% shared-area space of cell resources in the egress buffer.

[Spine 1] buffer egress cell queue 5 shared ratio 100

[Spine 1] buffer egress cell queue 6 shared ratio 100

[Spine 1] buffer apply

# Create WRED table QOS-EGRESS-100G-PORT. Configure the lower limit as 1000, the upper limit as 2000, and the drop probability as 20% for queue 5. Configure the lower limit as 3500, the upper limit as 20000, and the drop probability as 20% for queues 0, 1, 2, 3, 4, and 6. Configure the lower limit as 37999, the upper limit as 38000, and the drop probability as 20% for queue 7. Set the exponent for WRED to calculate the average queue size to 0 for queue 5. Enable ECN for queues 5 and 6.

[Spine 1] qos wred queue table QOS-EGRESS-100G-PORT

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 0 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 1 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 2 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 3 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 4 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 5 low-limit 1000 high-limit 2000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 6 low-limit 3500 high-limit 20000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 7 low-limit 37999 high-limit 38000 discard-probability 20

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 5 weighting-constant 0

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 5 ecn

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] queue 6 ecn

[Spine 1-wred-table-QOS-EGRESS-100G-PORT] quit

# Apply WRED table QOS-EGRESS-100G-PORT to interfaces HundredGigE 1/1/1 through HundredGigE 1/1/4. On these interfaces, enable byte-count WFQ. Configure queue 5 with the scheduling weight of 60, and configure queues 6 and 7 to use SP queuing. Configure GTS for packets in queue 6. Set the CIR to 50000000 kbps and the CBS to 16000000 bytes.

[Spine 1] interface range hundredgige 1/1/1 to hundredgige 1/1/4

[Spine 1-if-range] qos wred apply QOS-EGRESS-100G-PORT

[Spine 1-if-range] qos wfq byte-count

[Spine 1-if-range] qos wfq ef group 1 byte-count 60

[Spine 1-if-range] qos wfq cs6 group sp

[Spine 1-if-range] qos wfq cs7 group sp

[Spine 1-if-range] qos gts queue 6 cir 50000000 cbs 16000000

[Spine 1-if-range] quit

4.     Configure Spine 2, Spine 3, and Spine 4:

Configure Spine 2, Spine 3, and Spine 4 in same way Spine 1 is configured. (Details not shown.)

Configuring PFC and PFC deadlock detection

1.     Configure Leaf 1-1:

# Configure the maximum number of cell resources in the headroom as 9928. Calculate the maximum number of resources available in the headroom according to the total headroom buffer thresholds configured for all interfaces. For example, if there are 8*100-GE interfaces (each 100-GE interface is configured with the headroom buffer threshold 491) and 48*25-GE interfaces (each 25-GE is configured with the headroom buffer threshold 125), you must configure the maximum number of cell resources in the headroom as 9928 (491*8+125*48=9928).

[Leaf 1-1] priority-flow-control poolID 0 headroom 9928

# Set the PFC deadlock detection interval to 10, and set the high precision for the PFC deadlock detection timer.

[Leaf 1-1] priority-flow-control deadlock cos 5 interval 10

[Leaf 1-1] priority-flow-control deadlock precision high

# On interfaces Twenty-FiveGigE 1/0/1, Twenty-FiveGigE 1/0/2, and HundredGigE 1/0/25 through HundredGigE 1/0/28, configure trusting DSCP values in packets, enable PFC, and enable PFC for 802.1p priority 5. Enable PFC deadlock detection. Set the link-up event suppression interval to 2 seconds.

[Leaf 1-1] interface range twenty-fivegige 1/0/1 to twenty-fivegige 1/0/2 hundredgige 1/0/25 to hundredgige 1/0/28

[Leaf 1-1-if-range] qos trust dscp

[Leaf 1-1-if-range] priority-flow-control enable

[Leaf 1-1-if-range] priority-flow-control no-drop dot1p 5

[Leaf 1-1-if-range] priority-flow-control deadlock enable

[Leaf 1-1-if-range] link-delay up 2

[Leaf 1-1-if-range] quit

# Configure PFC thresholds for interfaces HundredGigE 1/0/25 through HundredGigE 1/0/28.

[Leaf 1-1] interface range hundredgige 1/0/25 to hundredgige 1/0/28

[Leaf 1-1-if-range] priority-flow-control dot1p 5 headroom 491

[Leaf 1-1-if-range] priority-flow-control dot1p 5 reserved-buffer 17

[Leaf 1-1-if-range] priority-flow-control dot1p 5 ingress-buffer dynamic 5

[Leaf 1-1-if-range] priority-flow-control dot1p 5 ingress-threshold-offset 12

[Leaf 1-1-if-range] quit

# Configure PFC thresholds for interfaces Twenty-FiveGigE 1/0/1 and Twenty-FiveGigE 1/0/2.

[Leaf 1-1] interface range twenty-fivegige 1/0/1 to twenty-fivegige 1/0/2

[Leaf 1-1-if-range] priority-flow-control dot1p 5 headroom 125

[Leaf 1-1-if-range] priority-flow-control dot1p 5 reserved-buffer 17

[Leaf 1-1-if-range] priority-flow-control dot1p 5 ingress-buffer dynamic 5

[Leaf 1-1-if-range] priority-flow-control dot1p 5 ingress-threshold-offset 12

[Leaf 1-1-if-range] quit

2.     Configure Leaf 1-2, Leaf 2-1, and Leaf 2-2:

Configure Leaf 1-2,  Leaf 2-1, and Leaf 2-2 in same way Leaf 1-1 is configured. (Details not shown.)

3.     Configure Spine 1:

# Configure the maximum number of cell resources in the headroom as 19640. Calculate the maximum number of resources available in the headroom according the total headroom buffer thresholds configured for all interfaces. In this example, use 19640 as an example.

[Spine 1] priority-flow-control poolID 0 headroom 19640

# Set the PFC deadlock detection interval to 10, and set the high precision for the PFC deadlock detection timer.

[Spine 1] priority-flow-control deadlock cos 5 interval 10

[Spine 1] priority-flow-control deadlock precision high

# On interfaces HundredGigE 1/1/1 through HundredGigE 1/1/4, configure trusting DSCP values in packets, enable PFC, and enable PFC for 802.1p priority 5. Enable PFC deadlock detection. Set the link-up event suppression interval to 2 seconds.

[Spine 1] interface range hundredgige 1/1/1 to hundredgige 1/1/4

[Spine 1-if-range] qos trust dscp

[Spine 1-if-range] priority-flow-control enable

[Spine 1-if-range] priority-flow-control no-drop dot1p 5

[Spine 1-if-range] priority-flow-control deadlock enable

[Spine 1-if-range] link-delay up 2

# Configure PFC thresholds for interfaces HundredGigE 1/1/1 through HundredGigE 1/1/4.

[Spine 1-if-range] priority-flow-control dot1p 5 headroom 491

[Spine 1-if-range] priority-flow-control dot1p 5 reserved-buffer 20

[Spine 1-if-range] priority-flow-control dot1p 5 ingress-buffer dynamic 5

[Spine 1-if-range] priority-flow-control dot1p 5 ingress-threshold-offset 12

[Spine 1-if-range] quit

4.     Configure Spine 2, Spine 3, and Spine 4:

Configure Spine 2, Spine 3, and Spine 4 in same way Spine 1 is configured. (Details not shown.)

Verifying the configuration

Connect 67 servers to four spine devices and five groups of leaf devices (10 member devices). Then, verify the configuration in one of the following methods.

Method 1

·     Verification conditions: Servers connected to the same leaf device group send traffic to each other. The packet length is 512 bytes.

·     Verification results: The traffic speeds are as shown in Figure 14. No traffic is lost, the traffic speed is stable, and ECN takes effect throughout the process.

Figure 14 Servers connected to the same leaf device group send traffic to each other

 

Method 2

·     Verification conditions: Servers connected to a group of leaf devices and servers connected to the same leaf device group and other leaf device groups  send traffic to each other. The packet length is 256 bytes.

·     Verification results: The traffic speeds are as shown in Figure 15. No traffic is lost, the traffic speed is stable, and ECN takes effect throughout the process. PFC actions are triggered multiple times. However, ECN preferentially takes effect generally.

Figure 15 Servers connected to a group of leaf devices and servers connected to the same leaf device group and other leaf device groups send traffic to each other

 

In all, no RDMA traffic is lost during the whole process, the traffic speeds fluctuate normally, and ECN preferentially takes effect over PFC to control the traffic speeds.

 

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
All Support
  • Become a Partner
  • Partner Resources
  • Partner Business Management
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网