<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-xu-lsr-fare-04" ipr="trust200902">
  <front>
    <title abbrev="FARE using LSR">Fully Adaptive Routing Ethernet using
    LSR</title>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <author fullname="Shraddha Hegde" initials="S." surname="Hegde">
      <organization>Juniper</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>shraddha@juniper.net</email>

        <uri/>
      </address>
    </author>

    <author fullname="Zongying He" initials="Z." surname="He">
      <organization>Broadcom</organization>

      <address>
        <email>zongying.he@broadcom.com</email>
      </address>
    </author>

    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>

      <address>
        <email>wangjj@centec.com</email>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei</organization>

      <address>
        <email>hongyi.huang@huawei.com</email>
      </address>
    </author>

    <author fullname="Qingliang Zhang" initials="Q." surname="Zhang">
      <organization>H3C</organization>

      <address>
        <email>zhangqingliang@h3c.com</email>
      </address>
    </author>

    <author fullname="Hang Wu" initials="H." surname="Wu">
      <organization>Ruijie Networks</organization>

      <address>
        <email>wuhang@ruijie.com.cn</email>
      </address>
    </author>

    <author fullname="Yadong Liu" initials="Y." surname="Liu">
      <organization>Tencent</organization>

      <address>
        <email>zeepliu@tencent.com</email>
      </address>
    </author>

    <author fullname="Yinben Xia" initials="Y." surname="Xia">
      <organization>Tencent</organization>

      <address>
        <email>forestxia@tencent.com</email>
      </address>
    </author>

    <author fullname="Peilong Wang" initials="P." surname="Wang">
      <organization>Baidu</organization>

      <address>
        <email>wangpeilong01@baidu.com</email>
      </address>
    </author>

    <!--

-->

    <date day="18" month="May" year="2025"/>

    <abstract>
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, as well as visual
      and video data, often consisting of billions or even trillions of
      parameters. However, the training process for these models can be
      extremely resource-intensive, requiring the deployment of thousands or
      even tens of thousands of GPUs in a single AI training cluster.
      Therefore, three-stage or even five-stage CLOS networks are commonly
      adopted for AI networks. The non-blocking nature of the network become
      increasingly critical for large-scale AI models. Therefore, adaptive
      routing is necessary to dynamically distribute traffic to the same
      destination over multiple equal-cost paths, based on network capacity
      and even congestion information along those paths.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, as well as visual
      and video data, often consisting of billions or even trillions of
      parameters. However, the training process for these models can be
      extremely resource-intensive, requiring the deployment of thousands or
      even tens of thousands of GPUs in a single AI training cluster.
      Therefore, three-stage or even five-stage CLOS networks are commonly
      adopted for AI networks. Furthermore, In rail-optimized CLOS network
      topologies with standard GPU servers (HB domain of eight GPUs), the Nth
      GPUs of each server in a group of servers are connected to the Nth leaf
      switch, which provides higher bandwidth and non-blocking connectivity
      between the GPUs in the same rail. In rail-optimized network topology,
      most traffic between GPU servers would traverse the intra-rail networks
      rather than the inter-rail networks. In addition, whether in
      rail-optimal or rail-free networks, collective communication job
      schedulers always opt to schedule jobs with network topology awareness
      to minimize the amount of traffic going to the upper layers of the
      network.</t>

      <t>The non-blocking nature of the network, particularly at the lower
      layers, is essential for large-scale AI training clusters. AI workloads
      are usually very bandwidth-hungry and often generate several large data
      flows simultaneously. If traditional hash-based ECMP load balancing is
      used without optimization, it can lead to serious congestion and high
      latency in the network when multiple large data flows are directed to
      the same link. This congestion can result in longer-than-expected model
      training times, as job completion time depends on worst-case
      performance. Therefore, adaptive routing is necessary to dynamically
      distribute traffic to the same destination across multiple equal-cost
      paths, taking into account network capacity and even congestion along
      these paths. In essence, adaptive routing is a capacity- and even
      congestion-aware dynamic path selection algorithm.</t>

      <t>Furthermore, to reduce the congestion risk to the maximum extent, the
      routing should be more granular if possible. Flow-granular adaptive
      routing still has a certain statistical possibility of congestion.
      Therefore, packet-granular adaptive routing is more desirable although
      packet spray would cause out-of-order delivery issues. A flexible
      reordering mechanism must be put in place&#65288;e.g., egress ToRs or
      the receiving servers). Recent optimizations for RoCE and newly invented
      transport protocols as alternatives to RoCE no longer require handling
      out-of-order delivery at the network layer. Instead, the message
      processing layer is used to address it.</t>

      <t>To enable adaptive routing, no matter whether flow-granular or
      packet-granular adaptive routing, it is necessary to propagate network
      topology information, including link capacity across the CLOS network.
      Therefore, it seems straightforward to use link-state protocols such as
      OSPF or ISIS as the underlay routing protocol in the CLOS network,
      instead of BGP.</t>

      <t>Hence, this document defined a new prefix attribute sub-TLV referred
      to as Path Bandwidth sub-TLV, and describes how to use this sub-TLV
      together with the Maximum Bandwidth sub-TLV of the Link TLV as defined
      in OSPF or ISIS TE extensions <xref target="RFC3630"/> <xref
      target="RFC5329"/><xref target="RFC5305"/> to calculate end-to-end path
      bandwidth within the data center fabric so as to achieve adaptive
      routing.</t>

      <t>For information on how to resolve the flooding issue caused by the
      use of link-state protocols in large-scale CLOS networks, please refer
      to the following document <xref
      target="I-D.xu-lsr-flooding-reduction-in-clos"/>.</t>

      <t>Note that while adaptive routing, especially at the packet-granular
      level can help reduce congestion between switches in the network,
      thereby achieving a non-blocking fabric, it does not address the incast
      congestion issue which is commonly experienced in last-hop switches that
      are connected to the receivers in many-to-one communication patterns.
      Therefore, a congestion control mechanism is always necessary between
      the sending and receiving servers to mitigate such congestion.</t>
    </section>

    <section anchor="Abbreviations_Terminology" title="Terminology">
      <t>This memo makes use of the terms defined in <xref target="RFC1195"/>
      <xref target="RFC2328"/> and <xref target="RFC5340"/>.</t>
    </section>

    <section title="Path Bandwidth Sub-TLV">
      <t>When advertising IP reachability information across ISIS levels or
      OSPF areas, it needs to contain the path bandwidth associated with the
      advertised IP prefix which is used to indicate the minimum bandwidth of
      all links along the path towards that prefix.</t>

      <t>For ISIS, an optional sub-TLV referred to as Path Bandwidth sub-TLV
      is to be defined. This sub-TLV is type of TBD, and is four octets in
      length. The value is filled with the path bandwidth associated with a
      given prefix in IEEE floating point format. The units are bytes per
      second. This sub-TLV COULD be conveyed in TLVs 135, 235, 236, or 237 ,
      just like those prefix attribute-related sub-TLVs as defined in <xref
      target="RFC7794"/>.</t>

      <t>For OSPFv2, since The OSPFv2 Extended Prefix TLV <xref
      target="RFC7684"/> is used to advertise additional attributes associated
      with the prefix&#65292;an optional sub-TLV of the OSPFv2 Extended Prefix
      TLV referred to as Path Bandwidth sub-TLV is to be defined. This sub-TLV
      is type of TBD, and is four octets in length. The value is filled with
      the path bandwidth associated with a given prefix in IEEE floating point
      format. The units are bytes per second.</t>

      <t>For OSPFv3, an optional sub-TLV of the Intra-Area-Prefix TLV,
      Inter-Area-Prefix TLV, and External-Prefix TLV <xref target="RFC8362"/>
      referred to as Path Bandwidth sub-TLV is to be defined. This sub-TLV is
      type of TBD, and is four octets in length. The value is filled with the
      path bandwidth associated with a given prefix in IEEE floating point
      format. The units are bytes per second.</t>
    </section>

    <section title="Solution Description">
      <t/>

      <section title="Adaptive Routing in 3-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   +----+ +----+ +----+ +----+  
   | S1 | | S2 | | S3 | | S4 |  (Spine)
   +----+ +----+ +----+ +----+             
             
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
   | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 |  (Leaf)
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 


                              Figure 1]]></artwork>
          </figure></t>

        <t>(Note that the diagram above does not include the connections
        between nodes. However, it can be assumed that leaf nodes are
        connected to every spine node.)</t>

        <t>In a three-stage CLOS network as shown in Figure 1, also known as a
        leaf-spine network, all nodes MAY be in OSPF area zero or ISIS
        Level-2.</t>

        <t>Leaf nodes and spine nodes are enabled for adaptive routing. As
        such, those nodes will advertise the link capacity by using the
        Maximum Bandwidth sub-TLV. In addition, leaf nodes will advertise the
        path bandwidth associated with each prefix originating from them by
        using the Path Bandwidth sub-TLV. The value of the Path Bandwidth
        sub-TLV is filled with a maximum bandwidth value by default.</t>

        <t>When a leaf node, such as L1, calculates the shortest path to a
        particular IP prefix originated by another leaf node in the same OSPF
        area or ISIS Level-2 area, say L2, four equal-cost paths via four
        spine nodes (e.g., S1, S2, S3, and S4) respectively will be
        calculated. To achieve adaptive routing, the capacity associated with
        each path SHOULD be considered as a weight value of that path when
        performing weighted ECMP load-balancing. In particular, the minimum
        value among the capacity of the upstream link (e.g., L1-&gt;S1) , the
        capacity of the downstream link (S1-&gt;L2) of a given path (e.g.,
        L1-&gt;S1-&gt;L2) and the path bandwidth associated with that prefix
        would be used as a weight value for that end-to-end path when
        performing weighted ECMP load-balancing.</t>
      </section>

      <section title="Adaptive Routing in 5-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-1  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===============================     ===============================
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   # |SS1 | |SS2 | |SS3 | |SS4 | #     # |SS1 | |SS2 | |SS3 | |SS4 | #
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   #   (Super-Spine@Plane-1)     #     #   (Super-Spine@Plane-4)     #
   #============================== ... ===============================

   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-8  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================           

                              Figure 2]]></artwork>
          </figure>(Note that the diagram above does not include the
        connections between nodes. However, it can be assumed that the leaf
        nodes in a given PoD are connected to every spine node in that PoD.
        Similarly, each spine node (e.g., S1) is connected to all super-spine
        nodes in the corresponding PoD-interconnect plane (e.g.,
        Plane-1).)</t>

        <t>For a five-stage CLOS network as illustrated in Figure 2, each Pod
        consisting of leaf and spine nodes is configured as an OSPF non-zero
        area or an ISIS Level-1 area. The PoD-interconnect plane consisting of
        spine nodes and super-spine nodes is configured as an OSPF area zero
        or an ISIS Level-2 area. Therefore, spine nodes play the role of OSPF
        area border routers or ISIS Level-1-2 routers.</t>

        <t>All nodes are enabled for adaptive routing. As such, those nodes
        will advertise the link capacity by using the Maximum Bandwidth
        sub-TLV. In addition, leaf nodes will advertise the path bandwidth
        associated with each prefix originating from itself by using the Path
        Bandwidth sub-TLV. The value of the Path Bandwidth sub-TLV SHOULD be
        filled with a maximum bandwidth value by default.</t>

        <t>When leaking an IP prefix reachability from an OSPF non-zero area
        to area zero or from ISIS level-1 to level-2 (e.g., an IP prefix
        attached to a leaf node, such as L1@PoD-1), the path bandwidth value
        associated with the prefix would be readvertised or updated by OSPF
        border routers or ISIS level-1-2 routers (e.g., S1@PoD-1), and the
        value is filled with the minimum value between the bandwidth of the
        link towards the originating router (e.g., L1@PoD-1) and the original
        path bandwidth value associated with the prefix.</t>

        <t>When leaking the above IP prefix reachability from the OSPF area
        zero to a non-zero area or from ISIS level-2 to level-1, the path
        bandwidth value associated with the prefix would be readvertised or
        updated by OSPF border routers or ISIS level-1-2 routers (e.g.,
        S1@PoD-8) and the value is filled with the minimum value between the
        original bandwidth value associated with the prefix and the total
        bandwidth of all paths towards the advertising router of that prefix
        (e.g., S1@PoD-1).</t>

        <t>When a leaf node within PoD-8, calculates the shortest path to the
        above IP prefix, four equal-cost paths will be created via four spine
        nodes: S1, S2, S3, and S4 in PoD-8. To enable adaptive routing, the
        capacity of each path SHOULD be considered as a weight value for
        weighted ECMP load-balancing. In particular, the minimum value between
        the capacity of the upstream link (e.g., L1@Pod-8-&gt;S1@Pod-8) of
        each path (e.g., L1@Pod-8-&gt;S1@PoD-8) and the path bandwidth
        associated with that prefix is used as a weight value of that path
        when performing weighted ECMP load-balancing.</t>
      </section>
    </section>

    <section title="Modifications to SPF Computation Behavior ">
      <t>Once an OSPF or ISIS router is enabled for adaptive routing, the
      capacity of each SPF path SHOULD be calculated as a weight value of that
      path for weighted ECMP load-balancing purposes.</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD.</t>

      <!---->
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>

      <!---->
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.1195'?>

      <?rfc include='reference.RFC.2119'?>

      <?rfc include='reference.RFC.2328'?>

      <?rfc include='reference.RFC.3630'?>

      <?rfc include='reference.RFC.5305'?>

      <?rfc include='reference.RFC.5329'?>

      <?rfc include='reference.RFC.5340'?>

      <?rfc include='reference.RFC.7684'?>

      <?rfc include='reference.RFC.8362'?>

      <!---->
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.7794'?>

      <?rfc include="reference.I-D.xu-lsr-flooding-reduction-in-clos"?>

      <!---->
    </references>
  </back>
</rfc>
