<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="std" docName="draft-li-adaptive-load-balance-enhancement-00" ipr="trust200902">
  <front>
    <title abbrev="6man Working Group">Adaptive Load Balance Enhancement</title>

    <author fullname="Zhiqiang Li" initials="Z." surname="Li">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>lizhiqiangyjy@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Zongpeng Du" initials="Z." surname="Du">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>duzongpeng@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Wei Cheng" initials="W." surname="Cheng">
      <organization>Centec</organization>

      <address>
        <postal>
          <street/>

          <city>Suzhou</city>

          <code>215000</code>

          <country>China</country>
        </postal>

        <email>chengw@centec.com</email>
      </address>
    </author>

    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>

      <address>
        <postal>
          <street/>

          <city>Suzhou</city>

          <code>21500</code>

          <country>China</country>
        </postal>

        <email>wangjj@centec.com</email>
      </address>
    </author>

    <!---->

    <date day="03" month="March" year="2025"/>

    <area>Networking</area>

    <workgroup>Network Working Group</workgroup>

    <keyword>Quic;NAT</keyword>

    <abstract>
      <t>This draft proposes an adaptive load-balancing mechanism to address 
       high-throughput transmission challenges in East-West computing, data 
       exchange, and DC interconnection services. Traditional methods-ECMP, 
       RPS, and Flowlet-face limitations: ECMP suffers from hash collisions 
       and load imbalance; RPS risks TCP packet reordering; Flowlet depends 
       on impractical manual thresholds for burst interval configuration, 
       leading to suboptimal performance.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in .</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>East Data West Computing, Data Express, and DC Interconnection 
      services require high-throughput network transmission. Current 
      mainstream high-throughput interconnection and load balancing 
      mechanisms in the industry include ECMP, RPS, and Flowlet.</t>

      <t>ECMP transmits data flows as units. Its principle is when 
      the first packet of a data flow arrives at the first switch 
      from the host, the switch determines whether it is the first 
      packet of the flow based on the five-tuple (source IP, 
      destination IP, source port, destination port, protocol). 
      If so, a hash algorithm selects a forwarding port (non-first 
      packets follow the same port as previous packets). The advantage 
      of ECMP is preventing TCP packet reordering and retransmission. 
      However, poor hash design may cause collisions, leading to load 
      imbalance and increased queuing delays. Even with optimal hashing, 
      short flows may be assigned to ports handling long flows, worsening delays.</t>

      <t>RPS transmits individual packets as units. Each packet 
      arriving at the switch is randomly assigned a forwarding port, 
      regardless of flow affiliation. RPS achieves near-perfect load 
      balancing under high traffic, minimizing queuing delays. However, 
      random port assignment risks TCP packet reordering, causing retransmissions. 
      While rare, severe imbalance could amplify this issue.</t>

      <t>Flowlet transmits flow segments (bursts of packets sent 
      together in TCP). It leverages the time gap between TCP bursts: 
      if the interval between two bursts exceeds the transmission delay 
      threshold, the new flowlet is randomly assigned a port (otherwise, 
      it follows the previous port). Flowlet balances ECMP's stability with RPS's 
      load balancing. Its drawback lies in relying on manually set thresholds 
      for transmission delay differences. Overly large thresholds mimic ECMP, 
      while overly small ones mimic RPS.</t>
    </section>

    <section title="Problem Statement">
      <t>Throughput-sensitive services often involve elephant flows 
      (long-duration flows). Traditional load balancing mechanisms like ECMP, 
      which use five-tuple-based hashing, fail to fully utilize link bandwidth. 
      Packet-level load balancing (RPS) introduces out-of-order delivery, 
      and as bandwidth increases, the CPU and memory overhead for reordering 
      packets at the receiver becomes prohibitive (or even unmanageable). 
      Flowlet-based load balancing strikes a compromise between ECMP and RPS. 
      However, determining the transmission time difference threshold is 
      impractical, requiring manual configuration. Excessively large 
      thresholds degrade Flowlet into ECMP, while overly small thresholds 
      mimic RPS1. Meanwhile, the industry lacks standardized solutions 
      for increasingly critical high-throughput transmission demands.</t>
    </section>

    <section title="Solution">
      <t>Edge Ingress Node: The entry point for high-throughput service flows. 
      Based on service type (e.g., By DPI or directly use the service ID) 
      and link conditions to the destination node (obtained via controller 
      requests or distributed in-band detection), it implements multi-path 
      parallel transmission.  Specific steps include:

        1.Query all available paths to the destination and select the path 
        with lowest bandwidth utilization and lowest latency to send the 
        first packet (or flowlet) .
        2.For the second packet (or flowlet), choose a path with the 
        lowest bandwidth utilization that ensures no out-of-order delivery 
        after accounting for transmission delays .
        3.Subsequent packets follow the same logic, dynamically selecting 
        paths to balance bandwidth efficiency and sequence integrity .
        4.When selecting packet sizes (e.g., single packet, flowlet, 
        or subflow), ensure sufficient time margin to avoid out-of-order 
        issues. For example, shorter subflows on high-latency paths and 
        longer subflows on low-latency paths .
       </t>

      <t>Intermediate Nodes: For high-throughput flows, these nodes 
      perform hierarchical load balancing if instructed by packet headers; 
      otherwise, they act as normal nodes for transparent forwarding.</t>

      <t>Edge Egress Node: For strictly ordered services, it performs 
      final in-order verification. If out-of-order packets occur (rarely), 
      this node buffers and reorders them to ensure sequential delivery to the receive.</t>

      <t>Interactive packets can be carried across multiple data plane protocols. 
      Taking the IPv6 extension header as an example,the Next Header field 
      for the high-throughput in-order transmission extension header is 
      temporarily assigned the value 100 (to be updated after formal IANA allocation). 
      The extension header length is 12 bytes, and its specific format 
      follows the definition provided earlier.</t>
  </section>

  <section anchor="Security" title="Security Considerations">
    <t>TBD.</t>
  </section>
    <section anchor="IANA" title="IANA Considerations">
    <t>TBD.</t>
  </section>
  </middle>
  <back/>

</rfc>
