Gap analysis of transport protocols of high performance wide area network

Gap analysis of transport protocols of high performance wide area network China Mobile

Beijing 100053 China yaokehan@chinamobile.com

China Mobile

Beijing 100053 China yanghongwei@chinamobile.com

Web and Internet Transport Area High Performance Wide Area Network Data movement; RDMA; This document analyzes the throughput performance of existing transport protocols under different implementation modes, including kernel space based mode, user space based mode, and offloading-based implementations, and concludes that existing technologies are either limited by host CPU overhead or by the complexity of offloading, and cannot guarantee high throughput approaching the bandwidth rate of the network adapter. Accordingly, this document proposes new requirements for the design of HP-WAN transport protocol. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

HP-WAN requires meeting the needs of massive data transmission over long-distant, lossy, and shared wide-area network infrastructure, with data transmission volumes typically in the terabytes(TBs) to petabytes(PBs) range, and throughput is the key performance indicator. The design of transport protocols is critical, including rate control, congestion control, and multi-stream processing. Transport protocols such as TCP and QUIC, which run in the kernel space, can improve end-to-end data transmission throughput to some extent through the optimizations of congestion control and multi-path transmission algorithms. But they inevitably introduce excessive CPU overhead, resulting in that actual throughput cannot approach the bandwidth of the endpoint network adapters, and increasing more operational costs. Currently, super-fast Ethernet has become the industry trend, with commercial products available at 400G and accelerating towards 800G and higher T-bit level bandwidth evolution, while the performance growth rate of CPUs has become slower, the gap between the two growth rates is becoming more obvious, so the utilization rate of the endpoint CPU for data transmission is becoming increasingly important for improving endpoint throughput. This document analyzes the throughput performance of existing transport protocols under different implementation modes, including kernel space based mode, user space based mode, and offloading-based implementations, and concludes that existing technologies are either limited by host CPU overhead or by the complexity of offloading, and cannot guarantee high throughput approaching the bandwidth rate of the network adapter. Accordingly, this document proposes new requirements for the design of HP-WAN transport protocol.

This document makes use of the following terms: Transport Offload Engine. A series of techniques to offload part of TCP protocol stack or TCP processing to network adapters. Remote Direct Memory Access. Bypass CPU to access the memory of the other network endpoint. RDMA over Converged Ethernet version 2. The second version of applying RDMA transport layer which originates from Inifiniband over TCP/IP stack. internet Wide Area RDMA Protocol. A protocol suite that contains several layers and protocols to realize RDMA functionality over TCP/IP stack. Even though this document is not a protocol specification, it makes use of upper case key words to define requirements unambiguously. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

There are three main ways to implement TCP/IP protocol stack: running in user space, running in kernel space, and offloading-based solutions. Offloading-based methods can be further divided into partial offloading to assist kernels and RDMA. Both fully running in kernel space and running in user space lead to very high CPU overhead and suffer from throughput performance bottlenecks. Partial offloading methods, such as TOE, can reduce CPU overhead to some extent, but still face throughput performance bottlenecks. Existing RDMA implementations either require support from intermediate network node or have high offloading complexity, and still cannot guarantee satisfactory performance requirements.

TCP, QUIC, and other mainstream transport protocols primarily run in the kernel space of operating systems such as Linux. After long-term community version iterations and maintenance, these protocols have excellent reliability and scalability, such that they can meet the performance requirements of most Internet applications, such as multimedia and web applications. However, the core requirement of HP-WAN services is the extremely high data throughput. Although the throughput performance of these kernel-based protocols can also be improved, with the maturity of Bottleneck Bandwidth and Round-trip propagation time(BBR) , the CPU overhead is still very high. Reasons why these kernel-based transport protocols consume high CPU resources are that: * Frequent copy between kernel space and user space, especially on the receiver side, leads to excessive CPU resource consumption. * Interruption operations take up a lot of CPU virtual cores and there is a contention between the interruption and the processing of TCP/IP protocol stack. * Multi-stream transmission can improve the total throughput to a certain extent, but in general, a stream needs to bind a CPU core, and multiple streams will compete for CPU resources, resulting in CPU load imbalance and affecting the actual throughput. * In modern non-uniform memory access(NUMA) architecture, there is a lot of communication overhead between CPU cores. * Control messages take up a lot of CPU resources. Implementing complete TCP/IP transport protocol stack in the kernel space of operating systems inevitably introduces the above problems, even though they are not in the scope of the IETF definition, but will affect the choice of technologies and protocol design. This is especially true when throughput performance metrics are critical. And it should be considered in the transport protocol design space.

The advantage of running complete TCP/IP protocol stack in user space is that it can effectively reduce the memory copy overhead between user space and kernel space, and even achieve zero-copy, which can reduce the processing latency, but it still introduces CPU overhead. The main reasons are the inability of threads and processes to share Socket and the overhead of multi-threaded locks. For example, when the main thread listens for new connection establishment in the Socket, the system needs to start a child process to handle the request, and the child process needs to access the newly connected Socket. But at the same time, the parent process needs to listen to the Socket for new connections. Socket contention between parent and child processes and the lock maintenance in multi-threaded applications are both detrimental to throughput performance.

TOE is a partial offloading technology, which offloads part of the TCP/IP stack to the hardware of the network adapter, but some of the protocol stack still runs in the kernel space of the operating system. This technique can reduce the CPU overhead to some extent. Many mature TOE technologies have been accepted by the industry, such as generic segmentation offload(GSO), large receiver offload(LRO), receiver side scaling(RSS), and checksum offloading. However, TOE technology has a performance bottleneck. For example, Appendix A has listed the throughput performance measurement of TCP and QUIC protocols under different congestion control algorithms like BBR and CUBIC, with TOE enabled.

RDMA is an important technology to realize CPU by-pass, and there are standardized technologies applied in in data center networks and wide area networks, such as RoCEv2 and iWARP, etc., which has a history of about 20 years. But the current RDMA technology has some limitations. RoCEv2 is oriented to data center networks and requires lossless features from the network. RoCEv2 is not defined by IETF. iWARP is designed for data center networks, local area networks and wide area networks. Due to the underlying TCP protocol, iWARP is reliable without intermediate network support, which is aligned with design goal of HP-WAN transport protocol, but iWARP still has performance limitations.

RoCEv2 is defined by Infiniband Trade Association(IBTA) which compatibly adapts RDMA transport layer originally from Infiniband to TCP/IP protocol stack, and it is more in line with the development trend of Ethernet ecosystem. The condition of RoCEv2 to guarantee high throughput data transmission is that the network needs to provide congestion notification mechanisms such as explicit congestion notification(ECN) and data center bridging(DCB), and the ability of lossless transmission, such as priority-based flow control(PFC) mechanism. In short-distant data center networks, it is feasible to guarantee such capabilities, but in the Internet, providing such capabilities, especially for lossless networks, is extremely expensive. Therefore, native RoCEv2 is not suitable for high throughput data transmission in wide-area lossy environments.

iWARP protocol suite was defined in IETF, and it was based on TCP/IP stack so that it can work reliably itself. It consists of three main layers and protocols to achieve RDMA functionality over data center network, local area network, and wide area network. The first is the RDMA semantic layer where RDMA read, write and send semantics are defined. RDMAP RFC 5040 is in this layer. The second is the RDMA core functionality layer where message segmentation and reassembly, buffer models, and direct data placement in-order message delivery are defined. RDDP RFC 5041 is in this layer. The third is the boundary marking layer, where data framing and integrity, and Framed Protocol Data Unit (FPDU) alignment are defined. MPA RFC 5044is in this layer. Three offloading modes of iWARP are defined in , namely host-based, host-offloaded, and host-assisted. Host-based only offloads TCP stack to the network adapter. Host-offloaded mode offloads DDP, MPA and TCP stack to the network adapter, and host-assisted retains RDMAP and Markers function of MPA layer in host. Only DDP, CRC and TCP stack are offloaded. There is an obvious gap between the performance of the three. More details on the test performance are attached in the appendix. The conclusion is that three implementations can not guarantee high throughput, and the bandwidth of the network adapter can not be fully utilized. Perhaps this is part of the reasons why iWARP has less industry support and implementations.

+----------------+ +----------------+ +------------------+ | +-----++----+ | | +-----+ | | +-----++-------+ | | |RDMAP||RDDP| | | |RDMAP| | | |RDMAP||Markers| | | +-----++----+ | | +-----+ | | +-----++-------+ | | +---++-------+ | +-------+--------+ +--------+---------+ | |CRC||Markers| | | | | +---++-------+ | | | HOST +-------+--------+ | | ---------+------------------+-------------------+------------ | | | +-------v--------+ +-------v--------+ +------v-------+ | +-------+ | |+----++-------+ | | +----+ +---+ | | |TCP/IP | | ||RDDP||Markers| | | |RDDP| |CRC| | | +-------+ | |+----++-------+ | | +----+ +---+ | NIC +----------------+ |+---+ +-------+ | | +------+ | ||CRC| |TCP/IP | | | |TCP/IP| | |+---+ +-------+ | | +------+ | +----------------+ +--------------+

Based on the analysis above, exsiting tranport solutions introduce high CPU cost or high offloading complexity, which result in performance bottleneck. These solutions are not evolvable to satisfy HP-WAN high-throughput performance requirements in T-bit level Ethernet era. Therefore, the following new requirements are proposed.

For applications with ultra-high throughput transmission requirements, RDMA MUST be supported to reduce CPU overhead. It is RECOMMENDED that RDMA establish a connection with one handshake and breaks the connection with one handshake at the semantic level. Write operation MUST be supported, but read operation SHOULD be optional.

The iWARP protocol suite can fully offload TCP/IP transport layer, but due to the limitations of space and time complexity, performance still has a bottleneck, especially on throughput. Therefore, RDMA cannot be based on a complete TCP transport protocol implementation, and a lightweight transport protocol must be designed. On the other hand, the transport protocol should still have some reliability and congestion management functions to not depend on any auxiliary capabilities provided by the network layer.

In terms of congestion control, the mechanism for congestion detection is a key factor in the algorithm. Delay, bandwidth, and packet loss are all important signals for determining congestion, but they have their own advantages and disadvantages in terms of false positive rate, fairness, and algorithm complexity. Therefore, when congestion control mechanisms, there are some requirements that can be considered: * TCP's slow start and congestion avoidance mechanism MUST be improved to enhance the utilization of bandwidth. * It is RECOMMNED to transform the congestion window mechanism into a time-interval-based sending rate control mechanism which can be easier to implement on network adapter.

TCP's rate control relies on the sliding window mechanism, which ensures better reliability but also limits the maximum sending capability of the sender side to some extent. TCP's ability to handle out-of-order reception SHOULD be retained, and data packets are stored as soon as they arrive, reducing the occupancy of resources. In terms of handling lost packets, a precise re-transmission mechanism MUST be designed, with only lost packet re-transmission notification and supporting cumulative acknowledgement, reducing the impact of packet loss and transmission distance on throughput performance.

Multi-path transmission SHOULD be supported. In a single connection, multiple streams are maintained in parallel, and the streams are independent. Parameters of congestion control can be set separately.

In addition, security MUST be considered in most cases. Packet SHOULD be designed with high load ratio to fully consider the complexity of offloading. At the same time, the above components SHOULD be designed modularly, that is, congestion control, lost packet re-transmission, and data encryption can be turned on and off on demand.

It MUST not change the programming interfaces of mainstream applications such as Verbs, Libfabric, Socket, etc. At present, RDMA mainly supports Verbs and Libfabric interfaces, and big data transmission applications need to develop adaptations for these interfaces. If there are distributed applications requiring huge amount of data movement based on Socket interface in the future, the development of network adapter driver MUST be oriented to Socket interface. However, RDMA and Socket abstraction do not match, API compatibility is poor, and it is necessary to support standardized programming interface.

TBD.

Authors would like to thank other team members from China Mobile for their contributions. Guangyu Zhao, Shiping Xu, Zongpeng Du, Zhiqiang Li.

Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing

TCP was tested under the following environments: CPU type: Intel Xeon ® Gold 6248R processor, number of cores: 4x24=96, clock frequency: 3.0GHz, network adapter: 100Gbps Nvidia CX-5, PCIE 3.0x16, and operating system: CentOS 8. The experiment is based on the laboratory environment, and the single-stream and multi-stream throughput performance are tested under different congestion control algorithms. BBRv1 and Cubic are tested under the condition of 0.1% and 1% packet loss. It can be seen that the throughput performance of the BBR algorithm can reach more than 10Gbps in the case of a single stream, and the throughput reaches a peak of more than 80Gbps in the case of 25 streams sharing the network. The throughput performance of BBR algorithm is much higher than CUBIC, because CUBIC is very sensitive to packet loss. The performance of BBR still can not reach the ceiling of bandwidth. The reasons are that on the one hand, the receiver needs to maintain more overhead introduced by packet loss lookup and packet loss recovery. On the other hand, TOE's optimization effect becomes poor. These effects lead to CPU utilization rate approaches the bottleneck.

QUIC performance test is under the following environments: CPU type: Intel Xeon Gold 6248 @2.5GHz, 20 cores each. Network adapter: Nvidia CX-5, dual-port 100G, PCIE: 3.0x16. This test is over China Mobile backbone network(CMNET) from Harbin to Guiyang, over 3000 kilometers with average RTT around 65ms. The average packet loss is around 0.01% to 0.1%. BBRv1 is selected as the congestion control algorithm, and the test results are obtained under the condition of CPU utilization approaching saturation. In the test case of 40 CPU cores, when the number of streams exceeds 40, the throughput is maintained at about 50Gbps. In the test case of 80 CPU cores, when the number of streams exceeds 80, the throughput is maintained at about 63Gbps. Under the two test conditions, there is a big gap between the theoretical limit of 100Gbps bandwidth of network adapter and the real data transmission throughput. The results confirm the CPU overhead plays a pivotal role in endpoint data transmission throughput.

Test environment: Chelsio 10GE NIC, 2 Intel Xeon 3.0GHz processors, Redhat OS. The figure shows that the three offloading methods have very different performance in terms of actual network adapter throughput and host CPU utilization. In the processing of 128B small messages, the CPU utilization of the three methods is relatively low, but the throughput is also very low. When processing larger message of 256KB, it can be seen that although the CPU utilization rate of the host-offloaded method is very low, which can be maintained at 10%, its throughput is only 3.5GB, about 1/3 of the actual network adapter bandwidth, and its throughput performance is not good. The best throughput performance is achieved in host-assisted mode, which is partially offloaded. But the best throughput is only about 60% of the network adapter bandwidth, and its CPU utilization has reached 80%. The host-based model performs worse in terms of both bandwidth and CPU utilization. The above results show that there are performance bottlenecks in all the three implementations of iWARP, which cannot maintain high throughput close to the physical bandwidth of network adapters with low CPU consumption. One of the main influencing factors is the Markers function in MPA layer. Markers have an important role in marking the functional boundaries of TCP and RDDP protocols. Offloading Markers to network adapters requires too much state maintenance resources, which affects the actual throughput. If the Markers function is implemented inside the host, the CPU usage will be too high. Therefore, although TCP-based iWARP has good reliability and scalability, its performance is still limited and needs to be further improved.

+---------+----------+----------+----------+ | | MSG size | MSG size | MSG size | | | 128B | 1KB | 256KB | | +-----+----+-----+----+-----+----+ | | BW |CPU | BW |CPU | BW |CPU | | | |Util| |Util| |Util| +---------+-----+----+-----+----+-----+----+ | host- |100MB|12% |800MB|40% |1.8GB|75% | | based | | | | | | | +---------+-----+----+-----+----+-----+----+ | host- |1GB |10% |3.5GB|10% |3.5GB|10% | |offloaded| | | | | | | +---------+-----+----+-----+----+-----+----+ | host- |500MB|18% |3.5GB|50% |5.8GB|80% | | assisted| | | | | | | +---------+-----+----+-----+----+-----+----+