<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-xu-lsr-fare-01" ipr="trust200902">
  <front>
    <title abbrev="FARE">Fully Adaptive Routing Ethernet</title>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <author fullname="Zongying He" initials="Z." surname="He">
      <organization>Broadcom</organization>

      <address>
        <email>zongying.he@broadcom.com</email>
      </address>
    </author>

    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>

      <address>
        <email>wangjj@centec.com</email>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei</organization>

      <address>
        <email>hongyi.huang@huawei.com</email>
      </address>
    </author>

    <author fullname="Qingliang Zhang" initials="Q." surname="Zhang">
      <organization>H3C</organization>

      <address>
        <email>zhangqingliang@h3c.com</email>
      </address>
    </author>

    <author fullname="Hang Wu" initials="H." surname="Wu">
      <organization>Ruijie Networks</organization>

      <address>
        <email>wuhang@ruijie.com.cn</email>
      </address>
    </author>

    <author fullname="Yadong Liu" initials="Y." surname="Liu">
      <organization>Tencent</organization>

      <address>
        <email>zeepliu@tencent.com</email>
      </address>
    </author>

    <author fullname="Yinben Xia" initials="Y." surname="Xia">
      <organization>Tencent</organization>

      <address>
        <email>forestxia@tencent.com</email>
      </address>
    </author>

    <!--

-->

    <date day="29" month="January" year="2024"/>

    <abstract>
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, often consisting
      of billions or even trillions of parameters. However, the training
      process for these models can be extremely resource-intensive, requiring
      the deployment of thousands or even tens of thousands of GPUs in a
      single AI training cluster. Therefore, three-stage or even five-stage
      CLOS networks are commonly adopted for AI networks. The non-blocking
      nature of the network become increasingly critical for large-scale AI
      models. Therefore, adaptive routing is necessary to dynamically load
      balance traffic to the same destination over multiple ECMP paths, based
      on network capacity and even congestion information along those
      paths.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, often consisting
      of billions or even trillions of parameters. However, the training
      process for these models can be extremely resource-intensive, requiring
      the deployment of thousands or even tens of thousands of GPUs in a
      single AI training cluster. Therefore, three-stage or even five-stage
      CLOS networks are commonly adopted for AI networks. Furthermore, In
      rail-optimized CLOS topologies with standard GPU servers (HB domain of
      eight GPUs), the Nth GPUs of each server in a group of servers are
      connected to the Nth leaf switch, which provides higher bandwidth and
      non-blocking connectivity between the GPUs in the same rail. In
      rail-optimized topology, most traffic between GPU servers would traverse
      the intra-rail networks rather than the inter-rail networks.</t>

      <t>The non-blocking nature of the network, especially the network for
      intra-rail communication, become increasingly critical for large-scale
      AI models. AI workloads tend to be extremely bandwidth-hungry and they
      usually generate a few elephant flows simultaneously. If the traditional
      hash-based ECMP load-balancing was used without any optimization, it's
      highly possible to cause serious congestion and high latency in the
      network once multiple elephant flows are routed to the same link. Since
      the job completion time depends on worst-case performance, serious
      congestion will result in model training time longer than expected.
      Therefore, adaptive routing is necessary to dynamically load balance
      traffic to the same destination over multiple ECMP paths, based on
      network capacity and even congestion information along those paths. In
      other words, adaptive routing is a capacity-aware and even
      congestion-aware path selection algorithm.</t>

      <t>Furthermore, to reduce the congestion risk to the maximum extent, the
      routing should be more granular if possible. Flow-granular adaptive
      routing still has a certain statistical possibility of congestion.
      Therefore, packet-granular adaptive routing is more desirable although
      packet spray would cause out-of-order delivery issue. A flexible
      reordering mechanism must be put in place&#65288;e.g., egress ToRs or
      the receiving servers). Recent optimizations for RoCE and newly invented
      transport protocols as alternatives to RoCE no longer require handling
      out-of-order delivery at the network layer. Instead, the message
      processing layer is used to address it.</t>

      <t>To enable adaptive routing, no matter whether flow-granular or
      packet-granular adaptive routing, it is necessary to propagate network
      topology information, including link capacity and/or even available link
      capacity (i.e., link capacity minus link load) across the CLOS network.
      Therefore, it seems straightforward to use link-state protocols such as
      OSPF or ISIS as the underlay routing protocol in the CLOS network,
      instead of BGP, for propagating link capacity information and/or even
      available capacity information by using OSPF or ISIS TE Metric or
      Extended TE Metric <xref target="RFC3630"/> <xref target="RFC7471"/>
      <xref target="RFC5305"/> <xref target="RFC7810"/>. </t>

      <t>For information on resolving flooding issues caused by link-state
      protocols in large CLOS networks, please refer to the following draft
      <xref target="I-D.xu-lsr-flooding-reduction-in-clos"/>.</t>

      <t>Note that while adaptive routing especially at the packet-granular
      level can help reduce congestion between switches in the network,
      thereby achieving a non-blocking fabric, it does not address the incast
      congestion issue which is commonly experienced in last-hop switches that
      are connected to the receivers in many-to-one communication patterns.
      Therefore, a congestion control mechanism is always necessary between
      the sending and receiving servers to mitigate such congestion.</t>
    </section>

    <section anchor="Abbreviations_Terminology" title="Terminology">
      <t>This memo makes use of the terms defined in <xref target="RFC2328"/>
      and <xref target="RFC1195"/>.</t>
    </section>

    <section title="Solution Description">
      <t/>

      <section title="Adaptive Routing in 3-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   +----+ +----+ +----+ +----+  
   | S1 | | S2 | | S3 | | S4 |  (Spine)
   +----+ +----+ +----+ +----+             
             
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
   | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 |  (Leaf)
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 


                              Figure 1]]></artwork>
          </figure></t>

        <t>(Note that the diagram above does not include the connections
        between nodes. However, it can be assumed that leaf nodes are
        connected to every spine node in their CLOS topology.)</t>

        <t>In a three-stage CLOS network as shown in Figure 1, also known as a
        leaf-spine network, all nodes SHOULD be in OSPF area zero or ISIS
        Level-2.</t>

        <t>Leaf nodes are enabled for adaptive routing for OSPF area zero or
        ISIS Level-2.</t>

        <t>When a leaf node, such as L1, calculates the shortest path to a
        specific IP prefix originated by another leaf node in the same OSPF
        area or ISIS Level-2 area, say L2, four equal-cost multi-path (ECMP)
        routes will be created via four spine nodes: S1, S2, S3, and S4. To
        enable adaptive routing, weight values based on link capacity or even
        available link capacity associated with upstream and downstream links
        SHOULD be considered for global load-balancing. In particular, the
        minimum value between the capacity of upstream link (e.g., L1-&gt;S1)
        and the capacity of downstream link (S1-&gt;L2) of a given path (e.g.,
        L1-&gt;S1-&gt;L2) is used as a weight value for that path when
        performing weighted ECMP load-balancing.</t>
      </section>

      <section title="Adaptive Routing in 5-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-1  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===============================     ===============================
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   # |SS1 | |SS2 | |SS3 | |SS4 | #     # |SS1 | |SS2 | |SS3 | |SS4 | #
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   #   (Super-Spine@Plane-1)     #     #   (Super-Spine@Plane-4)     #
   #============================== ... ===============================

   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-8  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================           

                              Figure 2]]></artwork>
          </figure>(Note that the diagram above does not include the
        connections between nodes. However, it can be assumed that the leaf
        nodes in a given PoD are connected to every spine node in that PoD.
        Similarly, each spine node (e.g., S1) is connected to all super-spine
        nodes in the corresponding PoD-interconnect plane (e.g.,
        Plane-1).)</t>

        <t>For a five-stage CLOS network as illustrated in Figure 2, each Pod
        consisting of leaf and spine nodes is configured as an OSPF non-zero
        area or an ISIS Level-1 area. The PoD-interconnect plane consisting of
        spine and super-spine nodes is configured as an OSPF area zero or an
        ISIS Level-2 area. Therefore, spine nodes play the role of OSPF area
        border routers or ISIS Level-1-2 routers.</t>

        <t>In rail-optimized topology, Intra-rail communication with high
        bandwidth requirements would be restricted to a single PoD. Inter-rail
        communication with lower bandwidth requirements can traverse across
        PoDs through the PoD-interconnect planes. Therefore, enabling adaptive
        routing only in PoD networks is sufficient. In particular, only leaf
        nodes are enabled for adaptive routing in their associated OSPF
        non-zero area or ISIS Level-1 area.</t>

        <t>When a leaf node within a given PoD (a.k.a., in a given OSPF
        non-zero area or ISIS Level-1 area), such as L1 in PoD-1, calculates
        the shortest path to a specific IP prefix originated by another leaf
        node in the same PoD, say L2 in PoD-1, four equal-cost multi-path
        (ECMP) routes will be created via four spine nodes: S1, S2, S3, and S4
        in the same PoD. To enable adaptive routing, weight values based on
        link capacity or even available link capacity associated with upstream
        and downstream links SHOULD be considered for global load-balancing.
        In particular, the minimum value between the capacity of upstream link
        (e.g., L1-&gt;S1) and the capacity of downstream link (e.g.,
        S1-&gt;L2) of a given path (e.g., L1-&gt;S1-&gt;L2) is used as a
        weight value of that path.</t>
      </section>
    </section>

    <section title="Modifications to OSPF and ISIS Behavior ">
      <t>Once an OSPF or ISIS router is enabled for adaptive routing, the
      capacity or even the available capacity of the SPF path SHOULD be
      calculated as a weight value for global load-balancing purposes.</t>

      <t>When advertising the available link capacity metric alongside the
      link capacity metric, it is important to maintain adaptive routing
      stable enough. To achieve this, a threshold SHOULD be set for the
      available link capacity fluctuation to avoid frequent LSA or LSP
      advertisements. That's to say, it's useful to avoid sending any update
      that would otherwise be triggered by a minor available link capacity
      fluctuation below that threshold.</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD.</t>

      <!---->
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>

      <!---->
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <?rfc include='reference.RFC.2328'?>

      <?rfc include='reference.RFC.5340'?>

      <?rfc include='reference.RFC.1195'?>

      <!---->
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.5305'?>

      <?rfc include='reference.RFC.7810'?>

      <?rfc include='reference.RFC.7471'?>

      <?rfc include='reference.RFC.3630'?>

      <?rfc include="reference.I-D.xu-lsr-flooding-reduction-in-clos"?>

      <!---->
    </references>
  </back>
</rfc>
