<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-xu-rtgwg-fare-in-sun-01" ipr="trust200902">
  <front>
    <title abbrev="FARE in SUN">Fully Adaptive Routing Ethernet in Scale-Up
    Networks</title>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <author fullname="Zongying He" initials="Z." surname="He">
      <organization>Broadcom</organization>

      <address>
        <email>zongying.he@broadcom.com</email>
      </address>
    </author>

    <author fullname="Hua Wang" initials="H." surname="Wang">
      <organization>Moore Threads</organization>

      <address>
        <email>wh@mthreads.com</email>
      </address>
    </author>

    <author fullname="Tianyou Zhou" initials="T." surname="Zhou">
      <organization>Resnics Technology</organization>

      <address>
        <email>tzhou@resnics.com</email>
      </address>
    </author>

    <author fullname="Yongtao Yang" initials="Y." surname="Yang">
      <organization>Centec</organization>

      <address>
        <email>yangyt@centec.com</email>
      </address>
    </author>

    <author fullname="Yinben Xia" initials="Y." surname="Xia">
      <organization>Tencent</organization>

      <address>
        <email>forestxia@tencent.com</email>
      </address>
    </author>

    <author fullname="Peilong Wang" initials="P." surname="Wang">
      <organization>Baidu</organization>

      <address>
        <email>wangpeilong01@baidu.com</email>
      </address>
    </author>

    <author fullname="Yan Zhuang" initials="Y." surname="Zhuang">
      <organization>Huawei Technologies</organization>

      <address>
        <email>zhuangyan.zhuang@huawei.com</email>
      </address>
    </author>

    <author fullname="Fajie Yang " initials="F." surname="Yang">
      <organization>Cloudnine Information Technologies </organization>

      <address>
        <email>yangfajie@cloudnineinfo.com</email>
      </address>
    </author>

    <author fullname="Chao Li" initials="C." surname="Li">
      <organization>Metanet Networking Technology</organization>

      <address>
        <email>lichao22@ieisystem.com</email>
      </address>
    </author>

    <author fullname="Wang Xiaojun" initials="X." surname="Wang">
      <organization>Ruijie Networks</organization>

      <address>
        <email>wxj@ruijie.com.cn</email>
      </address>
    </author>

    <!--

-->

    <date day="20" month="May" year="2025"/>

    <abstract>
      <t>The Mixture of Experts (MoE) has become a dominant paradigm in
      transformer-based artificial intelligence (AI) large language models
      (LLMs). It is widely adopted in both distributed training and
      distributed inference. Furthermore, the disaggregation of the prefill
      and decode phases is highly beneficial and is considered a best practice
      for distributed inference models; however, this approach depends on
      highly efficient Key-Value (KV) cache synchronization. To enable
      efficient expert parallelization and KV cache synchronization across
      dozens or even hundreds of Graphics Processing Units (GPUs) in MoE
      architectures, an ultra-high-throughput, ultra-low-latency AI scale-up
      network (SUN) that can efficiently distribute data across all network
      planes is critical. This document describes how to extend the Weighted
      Equal-Cost Multi-Path (WECMP) load-balancing mechanism, referred to as
      Fully Adaptive Routing Ethernet (FARE), which was originally designed
      for scale-out networks, to scale-up networks.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>The Mixture of Experts (MoE) has become a dominant paradigm in
      transformer-based artificial intelligence (AI) large language models
      (LLMs). It is widely adopted in both distributed training and
      distributed inference. Furthermore, the disaggregation of the prefill
      and decode phases is highly beneficial and is considered a best practice
      for distributed inference models; however, this approach depends on
      highly efficient Key-Value (KV) cache synchronization. </t>

      <t>To enable efficient expert parallelization and KV cache
      synchronization across dozens or even hundreds of Graphics Processing
      Units (GPUs) in MoE architectures, an ultra-high-throughput,
      ultra-low-latency AI scale-up network (SUN) is indispensable. This
      network serves as the interconnection fabric, allowing GPUs to function
      as a unified super GPU, referred to as a&nbsp;SuperPod. The scale-up
      network is fundamental for efficiently transporting substantial volumes
      of communication traffic within the SuperPod. It includes 1) all-to-all
      traffic for Expert Parallelism (EP) communication, enabling experts
      running on GPU servers to exchange information seamlessly, and 2)
      all-reduce traffic for Tensor Parallelism (TP) communication, ensuring
      consistent tensor values across GPUs during training and inference.</t>

      <figure>
        <artwork align="center"><![CDATA[      
   +----+ +----+ +----+ +----+  
   | L1 | | L2 | | L3 | | L4 |  (Leaf)
   +----+ +----+ +----+ +----+             
             
   +----+ +----+ +----+ +----+ +----+ +----+      +----+
   | G1 | | G2 | | G3 | | G4 | | G5 | | G6 | ...  |G64 |  (GPU)
   +----+ +----+ +----+ +----+ +----+ +----+      +----+ 


                              Figure 1]]></artwork>
      </figure>

      <t>(Note that the diagram above does not include the connections between
      GPUs and leaf switches. However, it can be assumed that GPUs are
      connected to every leaf switch in the above one-tier scale-up network
      topology.)</t>

      <t>As shown in Figure 1, it's a 64-GPU SuperPoD that consists of 64 GPUs
      and four leaf switches with high radix (e.g., 128 400G QSFP112 ports).
      To achieve inter-GPU bandwidths of several terabits per second (Tbps) or
      higher, each GPU is typically equipped with multiple scale-up network
      ports (e.g., four 800 Gbps OSFP ports). Each port connects to a separate
      scale-up leaf switch via Y-cables, forming four distinct network
      planes.</t>

      <t>In such multi-plane scale-up networks, achieving ultra-high bandwidth
      and ultra-low latency requires two key strategies. First, efficiently
      distributing data across all network planes is critical. For instance,
      if an 800G port on a GPU fails, traffic destined for that GPU over the
      faulty plane must immediately cease. If one 400G sub-cable of a given
      800G Y-cable malfunctions, halving the bandwidth of the affected plane,
      traffic on that plane between the relevant GPU pair should be
      proportionally reduced. Second, incast traffic patterns inherent to
      all-to-all communication may cause congestion on the egress ports of a
      last-hop switch; therefore, a more efficient congestion management
      mechanism is required.</t>

      <t>This document describes how to extend the Weighted Equal-Cost
      Multi-Path (WECMP) load-balancing mechanism, referred to as Fully
      Adaptive Routing Ethernet (FARE) in <xref target="I-D.xu-idr-fare"/>,
      which was originally designed for scale-out netowrks, to scale-up
      networks.</t>
    </section>

    <section anchor="Abbreviations_Terminology" title="Terminology">
      <t>This memo makes use of the terms defined in <xref
      target="RFC2119"/>.</t>
    </section>

    <section title="Solution Description">
      <t>Each pair of GPUs establishes multiple Remote Direct Memory Access
      (RDMA) Queue Pairs (QPs) for data transmission by using the loopback
      addresses of the GPUs. Note that upper-layer adaptations can enable
      memory semantic operations (load/store/atomic) based on RDMA message
      semantics. However, implementation details are beyond the scope of this
      document.</t>

      <t>By acting as Border Gateway Protocol (BGP) speakers, GPU servers
      exchange BGP routes with connected switches of different planes (e.g.,
      advertising the reachability of their loopback addresses). This allows
      servers to obtain route reachability and available path bandwidth
      information for each destination GPU, enabling WECMP load balancing
      across multiple planes.</t>

      <t>Of course, some data-plane health check mechanisms running directly
      between GPUs, spanning each network plane, could be leveraged to speed
      up route convergence, especially in cases where a network plane is
      broken.</t>

      <section title="Per-Flow Weighted Load Balancing">
        <t>For per-flow weighted load balancing, a minimum of one QP per
        sub-port must be established across each network plane between a given
        pair of GPUs. Each QP utilizes a unique UDP source port to
        differentiate traffic flows. For example, if a physical port is
        divided into m sub-ports and there are n distinct network planes
        (where n &ge; 1), at least m &times; n QPs must be
        instantiated&mdash;one QP per sub-port per plane&mdash;to ensure
        proper flow distribution across all available paths. Consequently, the
        traffic between each pair of GPUs is balanced across all available
        network planes (a.k.a., QPs bound to those network planes) according
        to the path bandwidth values associated with those network planes. In
        addition, the traffic distributed to a given network plane (a.k.a.,
        QPs bound to that network plane) is further evenly distributed at the
        QP granularity across available links connected to that network plane
        by the source GPU server.</t>

        <t>GPU servers could utilize a&nbsp;connection tracking table&mdash;a
        technique commonly used in Server Load Balancer (SLB) systems&mdash;to
        implement per-flow weighted load balancing. When the path bandwidth of
        a route via a specific network plane to a destination GPU
        degrades&mdash;due to events such as network plane failures or partial
        link outages&mdash;existing Queue Pairs (QPs) traversing unaffected
        planes retain their established forwarding paths. Meanwhile, the
        source GPU must release all or a subset of QPs associated with the
        affected network plane, adjusting their usage in strict accordance
        with updated weight values that reflect the reduced capacity.
        Conversely, when path bandwidth via a previously degraded network
        plane recovers&mdash;such as after failed links or planes are
        restored&mdash;the source GPU reinstates all or a subset of QPs
        traversing that plane. This reestablishment is performed in alignment
        with the revised weight values, which now reflect the increased
        available bandwidth, ensuring optimal traffic distribution across all
        operational network paths. </t>

        <t>The expiration timer for connection tracking entries can be
        configured based on the traffic characteristics of collective
        communications, such as&nbsp;periodic burst patterns. For example,
        entries corresponding to QP can expire during the interval between
        consecutive bursts. This ensures that each batch of data transferred
        between GPU pairs is distributed according to the&nbsp;current weight
        values&nbsp;of available paths.</t>

        <t>The switch within each network plane should perform per-flow load
        balancing as well to ensure ordered packet delivery for all QPs.</t>
      </section>

      <section title="Per-Packet Weighted Load Balancing&#8232;">
        <t>For per-packet weighted load balancing, all QPs established between
        a pair of GPUs must support disordered packet delivery (e.g., via the
        Direct Data Placement mechanism as described in <xref
        target="RFC7306"/> .) Similarly, the traffic between each pair of GPUs
        is balanced across all available network planes according to the path
        bandwidth values associated with those network planes. In this mode, a
        single QP per network plane between a given GPU pair suffices, with
        packets sprayed evenly across all available links connected to that
        network plane by the source GPU server.</t>

        <t>The switch within each network plane could perform per-packet load
        balancing since disordered packet delivery is acceptable for all
        QPs.</t>
      </section>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD.</t>

      <!---->
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>

      <!---->
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <!---->
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.7306'?>

      <?rfc include="reference.I-D.xu-idr-fare"?>

      <!---->
    </references>
  </back>
</rfc>
