SprayLink Technology White Paper-6W100

HomeSupportTechnology LiteratureTechnology White PapersSprayLink Technology White Paper-6W100
Download Book
  • Released At: 12-11-2024
  • Page Views:
  • Downloads:
Table of Contents
Related Documents

SprayLink Technology White Paper

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copyright © 2024 New H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of New H3C Technologies Co., Ltd.

Except for the trademarks of New H3C Technologies Co., Ltd., any trademarks that may be mentioned in this document are the property of their respective owners.

This document provides generic technical information, some of which might not be applicable to your products.



Overview

With the development of data center network technologies, RDMA over Converged Ethernet (RoCE) has become a key component of modern data center network design. RoCE implements efficient remote direct memory access (RDMA) communication over Ethernet, significantly enhancing network transmission efficiency, reducing CPU load, and increasing the data center's capability to process large amount of data.

In an RoCE network, elephant flows and mice flows might both exist.

·     Elephant flows—Typically the network does not have large numbers of elephant flows. Each elephant flow contains large amount of data (such as service packets) with high bandwidth consumption.

·     Mice flows—Typically the network have large numbers of mice flows. Each mice flow contains small amount of data (such as RoCE protocol packets), and is sensitive to packet loss.

When both elephant flows and mice flows exist, the RoCE network might face the following issues:

·     Uneven load sharingPackets with the same attributes are always forwarded through the same path. As a result, elephant flows are forwarded only on one or a few links. In this case, some links might be idle and other links might get congested.

·     Low bandwidth usage—With elephant flows on the paths with heavier load and mice flows on the paths with lighter load, bandwidth usage between links becomes uneven.

To address the previous issues, H3C introduced the SprayLink solution.

As a per-packet load sharing solution for edge network convergence, SprayLink uses flow-based load sharing for protocol packets and Spray per-packet load sharing for service packets in the network, enhancing bandwidth usage for network-wide links. In addition, SprayLink adjusts the sequence of disordered packets on the host side to ensure the correctness and integrity of data packets. The SprayLink solution effectively solves the issues of uneven load sharing and low bandwidth usage in RoCE networks.

Technology background

Introduction to ECMP load sharing

Equal-Cost Multi-Path routing (ECMP) is implemented based on multiple routes discovered by the same routing protocol, with the same destination address and cost. When no routes with higher preference are available to the same destination, all these equal-cost routes are used. Packets to the destination will be distributed to the paths to achieve network load sharing.

Figure 1 ECMP load sharing

 

Types of ECMP load sharing

Classification by load sharing method

Per-flow load sharing

Packets with the same 5-tuple information (source IP address, destination IP address, source port number, destination port number, and protocol number) belong to the same data flow. Per-flow load sharing forwards packets of the same flow through the same link.

Figure 2 Per-flow load sharing

 

Per-packet load sharing

Per-packet load sharing selects a link for each packet in the order they arrive at the device. It selects links for forwarding the packets in a round robin manner.

Figure 3 Per-packet load sharing

 

Classification by load sharing criteria

Static load sharing

Static load sharing uses the packet fields (such as source MAC address, destination MAC address, and IP 5-tuple information) as hash calculation factors to determine the forwarding paths for the packets. Packets with the same fields will be forwarded through the same member link, and those with different fields might be forwarded through different member links.

Static load sharing performs link selection based on packet fields without considering the member link usage. This might result in uneven load sharing among member links. When elephant flows occur, link selection based on packet fields might cause the congestion on a specific link to get worse, resulting in increased latency and packet loss issues.

Dynamic load sharing

Dynamic load sharing optimizes load sharing from the time and bandwidth dimensions by introducing factors such as timestamps and real-time load metrics (port bandwidth usage and queue size). Devices that support dynamic load sharing continuously monitor the load status of links, and dynamically distribute packets to the paths with lighter loads to achieve load balancing.

Hash algorithm

Link load sharing is achieved through the hash algorithm. A hash algorithm, also known as a hash function, transforms any length of input into a fixed-length output called a hash value. This transformation is a type of compressed mapping, where the space of the hash value is typically much smaller than the input space. Different inputs might hash to the same output, and you cannot obtain the input value from the output value.

Figure 4 and Figure 5 show the load sharing calculation process by using the hash algorithm.

1.     Select the packet fields (field selection) for hash calculation based on the Ethernet network type of the packet.

The values for field selection on the device are divided into Block A and Block B. Block A is used for aggregate interface (LAGs) load sharing, and Block B is sued for ECMP load sharing.

2.     Field selection generates 13 hash bins, which include all the packet fields involved in hash calculation, as shown in Figure 5.

 

 

NOTE:

A hash bin is a unit for storing packet attribute data, typically in the form of an array or linked list.

 

3.     Hash bins calculate the hash result based on the configured load sharing hash algorithm. (LAGs: Hash A0 and Hash A1, ECMP: Hash B0 and Hash B1).

The hash result calculated from the traffic of certain characteristics has minimal changes in the higher bits, resulting in significant hash imbalance. You can change the hash calculation result by adjusting the hash algorithm, seed value, or shift value.

 

 

NOTE:

Hash imbalance: Only one or a few paths forward traffic, while other paths are assigned less traffic or no traffic at all.

 

4.     As shown in Figure 5, Hash A0 (16 bits), Hash A1 (16 bits), Hash B0 (16 bits), Hash B1 (16 bits), LBN (obtained through Ingress Port, 4 bits), Destination Port (7 bits), and LBID (Load Balance ID, 8 bits) are concatenated to form an 83-bit hash key. ECMP and LAGs select a subset from these 83 bits to calculate the output interface.

 

 

NOTE:

In this example, LAGs selects only Hash A1, and ECMP selects only Hash B1 for output interface calculation. All other fields are set to 0.

 

5.     Calculate the output interface by using the following offset formula:

Offset = ((hash value & 65535) % (flowset group size + 1)) & 0x3FF

The hash value & 65535 operation ensures that the hash value remains within a valid range. The & 0x3FF operation ensures that the output interface number is within a valid range.

 

 

NOTE:

·     The hash value is obtained through the XOR or CRC operation by using the hash key.

·     The flowset group size indicates the number of ECMP routes or the number of aggregation member ports.

 

Figure 4 Load sharing hash algorithm

 

Figure 5 Hash key calculation

 

SprayLink implementation

SprayLink enhances network performance and data transmission efficiency by implementing load sharing on the network side and packet reordering on the host side.

Network-side load sharing

Introduction

In an RoCE network, when both RoCE protocol packets and data packets use per-flow hashing, the limited variation in hash factors leads to increased hash imbalance, causing congestion on some links and higher latency. In addition, because RoCE protocol packets cannot be disordered, you must ensure consistency in the hash results of the protocol packets. Make sure RoCE protocol packets are forwarded through the same link.

To address the previous issue, the SprayLink solution introduced network-side load sharing to implement differentiated packet processing. That is, it performs flow-based hashing for protocol packets, and packet-based hashing for data packets. This minimizes the hash imbalance for data packets while ensuring protocol packet hash consistency.

Implementation mechanism

Spray hash algorithm

Traditional ECMP hash calculation does not consider the load of ECMP members when outputting link selection results. When traffic distribution among ECMP members is not as expected, the device cannot adjust the incorrect ECMP link selection result in time based on the load of ECMP members.

As a dynamic load sharing algorithm, spray hash incorporates bandwidth and queue depth into link selection in addition to the traditional ECMP hash calculation. It periodically monitors traffic to correct traffic imbalance issues.

·     Detection: Periodically measures traffic distribution among ECMP members. Measurement objects:

¡     Number of bytes sent by ECMP member ports.

¡     Number of units enqueued by ECMP members.

Spray hash calculates the previous data for each ECMP member by using the Exponentially Weighted Moving Average (EWMA) technique, sorts the data, and then selects the optimal path based on the hash calculation result.

·     Correction: Move traffic from ECMP members with heavier load to ECMP members with lighter load.

Network-side load sharing implementation

Device network interface cards (NICs) support automatic identification of RoCE protocol packets and data packets. After marking RoCE protocol packets as Reserved, device NICs can implement per-flow hashing for RoCE protocol packets and per-packet spray hashing for data packets by using the following solutions.

Solution 1

1.     Perform the ECMP hash calculation for all data flows and select the Spray load sharing mode. Use the spray hash algorithm to perform per-packet load sharing.

You can configure the spray load sharing mode globally or for a specific interface as needed.

#

ip load-sharing mode per-packet

#

interface interface-type interface-number

ip load-sharing mode per-packet spray//Configure the spray load sharing mode for a specific interface

#

ecmp mode spray//Configure the spray load sharing mode globally

#

 

2.     Issue a global, driver-customized load sharing ACL to match UDP packets with destination port number 4791 and a Reserved field of 0 (that is, RoCE protocol packets), and perform per-flow load sharing for matching packets. Packets that do not match the load sharing ACL are still forwarded through per-packet load sharing.

#

ip load-sharing acl

#

 

Figure 6 Working mechanism of network-side load sharing (solution 1)

 

Solution 2

1.     Perform ECMP hash calculation for all data flows and use the default normal mode for per-flow load sharing (the normal mode uses the RTAG7 hash algorithm).

2.     Configure custom ACL ACL1 to match RoCE protocol packets with UDP destination port number 4791 and a Reserved field of 0, and permit matching packets to pass through.

Configure PBR policy PBR1 to perform per-flow load sharing configured in step 1 for packets matching ACL1.

#

policy-based-route pbr1 permit node 0

if-match acl acl1

#

 

3.     Configure custom ACL ACL2 (with lower priority than ACL1) to match all data packets, and permit matching packets to pass through. Configure PBR policy PBR2 to perform spray per-packet load sharing for packets matching ACL2.

#

policy-based-route pbr2 permit node 11

if-match acl acl2

apply next-hop ip-address <1-n>

apply loadshare next-hop

apply loadshare-mode next-hop spray

#

 

Figure 7 Working mechanism of network-side load sharing (solution 2)

 

 

NOTE:

Select one of the following policy-based routing (PBR) types as needed:

·     Interface PBR—Guides the forwarding of packets received on a specific interface.

·     Global PBR—Guides the forwarding of packets received on all interfaces on the device.

 

Solution comparison

As a best practice to apply the SprayLink network-side load sharing solution to the packets received on all interfaces of the device, use solution 1. To apply the SprayLink network-side load sharing solution to the packets received only on specific interfaces, use solution 2.

Host-side adjustment of disordered packets

SprayLink uses the dynamic per-packet load sharing mechanism in spray mode to forward packets. The device selects the lightest load path from the ECMP group for forwarding based on the packets. Data packets from the same data flow might take different forwarding paths. The receiving host might face issues with out-of-order packets. The packet reordering function of the NIC on the receiving host can reorder the packets, ensuring data integrity and accuracy.

Figure 8 Host-side packet reordering

 

Typical networking applications

As shown in Figure 9, multiple servers are typically deployed in the RoCE network of a data center, which requires exchanging a large amount of data. To improve the data exchange efficiency and stability, you can use the SprayLink edge network convergence solution. The solution uses spray hash to perform per-packet load sharing for data packets, and reorder disordered packets at the receiving end. In addition, the solution performs the traditional per-flow load sharing for protocol packets (which cannot be disordered). Figure 9 illustrates the process for forwarding packets from service cluster 1 to service cluster 2. The packet forwarding process in the reverse direction is similar.

Figure 9 SprayLink application network diagram

  • Cloud & AI
  • InterConnect
  • Intelligent Computing
  • Intelligent Storage
  • Security
  • SMB Products
  • Intelligent Terminal Products
  • Product Support Services
  • Technical Service Solutions
All Services
  • Resource Center
  • Policy
  • Online Help
  • Technical Blogs
All Support
  • Become A Partner
  • Partner Policy & Program
  • Global Learning
  • Partner Sales Resources
  • Partner Business Management
  • Service Business
All Partners
  • Profile
  • News & Events
  • Online Exhibition Center
  • Contact Us
All About Us
新华三官网