<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-xu-idr-fare-00" ipr="trust200902">
  <front>
    <title abbrev="FARE using BGP">Fully Adaptive Routing Ethernet using
    BGP</title>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <!--

-->

    <date day="1" month="July" year="2024"/>

    <abstract>
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, often consisting
      of billions or even trillions of parameters. However, the training
      process for these models can be extremely resource-intensive, requiring
      the deployment of thousands or even tens of thousands of GPUs in a
      single AI training cluster. Therefore, three-stage or even five-stage
      CLOS networks are commonly adopted for AI networks. The non-blocking
      nature of the network become increasingly critical for large-scale AI
      models. Therefore, adaptive routing is necessary to dynamically load
      balance traffic to the same destination over multiple ECMP paths, based
      on network capacity and even congestion information along those
      paths.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Large language models (LLMs) like ChatGPT have become increasingly
      popular in recent years due to their impressive performance in various
      natural language processing tasks. These models are built by training
      deep neural networks on massive amounts of text data, often consisting
      of billions or even trillions of parameters. However, the training
      process for these models can be extremely resource-intensive, requiring
      the deployment of thousands or even tens of thousands of GPUs in a
      single AI training cluster. Therefore, three-stage or even five-stage
      CLOS networks are commonly adopted for AI networks. Furthermore, In
      rail-optimized CLOS topologies with standard GPU servers (HB domain of
      eight GPUs), the Nth GPUs of each server in a group of servers are
      connected to the Nth leaf switch, which provides higher bandwidth and
      non-blocking connectivity between the GPUs in the same rail. In
      rail-optimized topology, most traffic between GPU servers would traverse
      the intra-rail networks rather than the inter-rail networks.</t>

      <t>The non-blocking nature of the network, especially the network for
      intra-rail communication, become increasingly critical for large-scale
      AI models. AI workloads tend to be extremely bandwidth-hungry and they
      usually generate a few elephant flows simultaneAously. If the
      traditional hash-based ECMP load-balancing was used without any
      optimization, it's highly possible to cause serious congestion and high
      latency in the network once multiple elephant flows are routed to the
      same link. Since the job completion time depends on worst-case
      performance, serious congestion will result in model training time
      longer than expected. Therefore, adaptive routing is necessary to
      dynamically load balance traffic to the same destination over multiple
      ECMP paths, based on network capacity and even congestion information
      along those paths. In other words, adaptive routing is a capacity-aware
      and even congestion-aware path selection algorithm. </t>

      <t>Furthermore, to reduce the congestion risk to the maximum extent, the
      routing should be more granular if possible. Flow-granular adaptive
      routing still has a certain statistical possibility of congestion.
      Therefore, packet-granular adaptive routing is more desirable although
      packet spray would cause out-of-order delivery issue. A flexible
      reordering mechanism must be put in place&#65288;e.g., egress ToRs or
      the receiving servers). Recent optimizations for RoCE and newly invented
      transport protocols as alternatives to RoCE no longer require handling
      out-of-order delivery at the network layer. Instead, the message
      processing layer is used to address it.</t>

      <t>To enable adaptive routing, no matter whether flow-granular or
      packet-granular adaptive routing, it is necessary to propagate network
      topology information, including link capacity and/or even available link
      capacity (i.e., link capacity minus link load) across the CLOS network.
      Therefore, it seems straightforward to use link-state protocols such as
      OSPF or ISIS as the underlay routing protocol in the CLOS network,
      instead of BGP, for propagating link capacity information and/or even
      available link capacity information. How to leverage OSPF or ISIS to
      achieve adaptive routing has been described in <xref
      target="I-D.xu-lsr-fare"/>. However, some data center network operators
      have been used to the use of BGP as the underlay routing protocol of
      data center networks <xref target="RFC7938"/>. Therefore, there is a
      need to leverage BGP to achieve adaptive routing as well. </t>

      <t><xref target="I-D.ietf-idr-link-bandwidth"/> has specified a way to
      perform weighted ECMP based on link bandwidths conveyed in the
      non-transitive link bandwith extended community. However, it is
      impractical to enable adaptive routing by directly using the
      non-transitive link bandwidth extended community due to the following
      constraints as mentioned in <xref
      target="I-D.ietf-idr-link-bandwidth"/>. </t>

      <t>"No more than one link bandwidth extended community SHALL be attached
      to a route. Additionally, if a route is received with link bandwidth
      extended community and the BGP speaker sets itself as next-hop while
      announcing that route to other peers, the link bandwidth extended
      community should be removed. The extended community is optional
      non-transitive." </t>

      <t>Hence, this document defines a new extended community referred to as
      Path Bandwidth Extended Community and describes how to use this newly
      defined path bandwidth extended community to achieve adaptive routing.
      </t>

      <t>Note that while adaptive routing especially at the packet-granular
      level can help reduce congestion between switches in the network,
      thereby achieving a non-blocking fabric, it does not address the incast
      congestion issue which is commonly experienced in last-hop switches that
      are connected to the receivers in many-to-one communication patterns.
      Therefore, a congestion control mechanism is always necessary between
      the sending and receiving servers to mitigate such congestion.</t>
    </section>

    <section anchor="Abbreviations_Terminology" title="Terminology">
      <t>This memo makes use of the terms defined in <xref
      target="RFC4360"/>.</t>
    </section>

    <section title="Path Bandwidth Extended Community">
      <t>The Path Bandwidth Extended Community is used to indicate the minimum
      bandwith of the path towards the destination. It is an new IPv4 Address
      Specific Extended Community that can be transitive or
      non-transitive.</t>

      <t>The value of the high-order octet of this extended type is either
      0x01 or 0x41. The low-order octet of this extended type is TBD.</t>

      <t>The Value field consists of two sub-fields: </t>

      <t><list>
          <t>Global Administrator sub-field: This sub-field contains the
          router ID of the advertising router that appends the path bandwidth
          extended community or updates the path bandwidth value of the
          existing path bandwidth extended community. </t>

          <t>Local Administrator sub-field: This sub-field contains the path
          bandwidth value in IEEE floating point format with units of
          Gigabytes per second (GB/s).</t>
        </list></t>
    </section>

    <section title="Solution Description">
      <t/>

      <section title="Adaptive Routing in 3-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   +----+ +----+ +----+ +----+  
   | S1 | | S2 | | S3 | | S4 |  (Spine)
   +----+ +----+ +----+ +----+             
             
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
   | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 |  (Leaf)
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 


                              Figure 1]]></artwork>
          </figure></t>

        <t>(Note that the diagram above does not include the connections
        between nodes. However, it can be assumed that leaf nodes are
        connected to every spine node in their CLOS topology.)</t>

        <t>In a three-stage CLOS network as shown in Figure 1, also known as a
        leaf-spine network, each leaf node would establish eBGP sessions with
        all spine nodes. </t>

        <t>All nodes are enabled for adaptive routing.</t>

        <t>When a leaf node, such as L1, advertises the route to a specific IP
        prefix that it originates, it will attach a transitive path bandwidth
        extended community filled with a maximum bandwidth value. </t>

        <t>Upon receiving the above advertisement, a spine node, such as S1,
        SHOULD determine the minimum value between the bandwidth of the link
        towards the advertising node (e.g., L1) and the value of the path
        bandwidth extended community carried in the received route, and then
        update the path bandwidth extended community with the above minimum
        value before readvertising that route to remote eBGP peers. Once S1
        receives multiple equal-cost routes for a given prefix from multiple
        leaf nodes (e.g., L1 and L2 in the server multi-homing scenario), for
        each route, it SHOULD determine the minimum value between the
        bandwidth of the link towards the advertising node and the value of
        the path bandwidth extended community carried in the received route,
        and then use that minimum bandwidth value as a weight value for that
        route when performing weighted ECMP. When readvertising the route for
        that prefix to remote eBGP peers further, the path bandwidth extended
        community would be updated with the sum of the minimum bandwidth value
        of each route. </t>

        <t>When a leaf node, such as L8, receives multiple equal-cost routes
        for that prefix from spine nodes (e.g., S1, S2, S3 and S4), for each
        route, it will determine the minimum value between the bandwidth of
        the link towards the advertising node and the value of the path
        bandwidth extended community carried in the received route, and then
        use that minimum bandwidth value as a weight value for that route when
        performing weighted ECMP.</t>

        <t>Note that the weighted ECMP according to path bandwidth SHOULD NOT
        be performed unless all equal-cost routes for a given prefix carry the
        path bandwidth extended community.</t>
      </section>

      <section title="Adaptive Routing in 5-stage CLOS">
        <t><figure>
            <artwork align="center"><![CDATA[      
   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-1  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===============================     ===============================
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   # |SS1 | |SS2 | |SS3 | |SS4 | #     # |SS1 | |SS2 | |SS3 | |SS4 | #
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   #   (Super-Spine@Plane-1)     #     #   (Super-Spine@Plane-4)     #
   #============================== ... ===============================

   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-8  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================           

                              Figure 2]]></artwork>
          </figure>(Note that the diagram above does not include the
        connections between nodes. However, it can be assumed that the leaf
        nodes in a given PoD are connected to every spine node in that PoD.
        Similarly, each spine node (e.g., S1) is connected to all super-spine
        nodes in the corresponding PoD-interconnect plane (e.g.,
        Plane-1).)</t>

        <t>For a five-stage CLOS network as illustrated in Figure 2, each leaf
        node would establish eBGP sessions with all spine nodes of the same
        PoD while each spine node would establish eBGP sessions with all
        super-spine nodes in the corresponding PoD-interconnect plane.</t>

        <t>In rail-optimized topology, Intra-rail communication with high
        bandwidth requirements would be restricted to a single PoD. Inter-rail
        communication with relatively lower bandwidth requirements need to
        travel across PoDs through PoD-interconnect planes. Therefore,
        enabling adaptive routing only in PoD networks is sufficient. It's
        optional to perform adaptive routing for cross-PoD traffic.</t>

        <t>When a leaf node, such as L1 in PoD-1, advertises the route for a
        specific IP prefix that it originates, it will attach a transitive
        path bandwidth extended community filled with a maximum bandwidth
        value. </t>

        <t>Upon receiving the above route advertisement, a spine node, such as
        S1 in PoD-1, will determine the minimum value between the bandwidth of
        the link towards the advertising node (e.g., L1 in PoD-1) and the
        value of the path bandwidth extended community carried in the route,
        and then update the path bandwidth extended community with the above
        minimum value before readvertising that route to remote eBGP peers.
        Once S1 in PoD-1 receives multiple equal-cost routes for a given
        prefix from multiple leaf nodes (e.g., L1 and L2 in PoD-1 in the
        server multi-homing scenario), for each route, it will determine the
        minimum value between the bandwidth of the link towards the
        advertising node and the bandwidth value of the path bandwidth
        extended community carried in the route, and then use that minimum
        bandwidth value as a weight value for that route when performing
        weighted ECMP. When readvertising the route for that prefix to remote
        eBGP peers, the path bandwidth extended community would be updated
        with the sum of the minimum bandwidth value of each route. </t>

        <t>When a given super-spine node, such as SS1 in Plane-1, receives the
        route for that prefix from S1 in PoD-1, it will not update the
        transtive path bandwidth extended community when readvertising that
        route. It COULD optionally attach another path bandwidth extended
        community which is non-transitive to indicate the bandwith of the link
        towards the advertising router. </t>

        <t>When a given spine node in another PoD, such as S1 in PoD-8,
        receives multiple equal-cost routes for a given prefix from
        super-spine nodes in Plane-1 (e.g., SS1, SS2, SS3 and SS4 in Plane-1),
        it will not update the value of the transitive path bandwidth extended
        community when readvertising that route towards remote peers (Note
        that the transitive path bandwidth extended community of those
        multiple equal-cost routes carry the same value that was set by S1 in
        PoD-1). Meanwhile, once each route contains a non-transitive path
        bandwidth extended community, for each route, it will determine the
        minimum value between the bandwidth of the link towards the
        advertising node and the bandwidth value of the non-transitive path
        bandwidth extended community carried in the route, and then use that
        minimum bandwidth value as a weight value for that route when
        performing weighted ECMP.</t>

        <t>When a leaf node, such as L8 in PoD-8, receives multiple equal-cost
        routes for that prefix from multiple spine nodes (e.g., S1, S2, S3 and
        S4 in PoD-8), for each route, it will determine the minimum value
        between the bandwidth of the link towards the advertising node and the
        value of the path bandwidth extended community carried in the route,
        and then use that minimum bandwidth value as a weight value for that
        route when performing weighted ECMP.</t>

        <t>Note that the weighted ECMP according to path bandwidth SHOULD NOT
        be performed unless all equal-cost routes for a given prefix carry the
        path bandwidth extended community.</t>
      </section>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD.</t>

      <!---->
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>

      <!---->
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <?rfc include='reference.RFC.4360'?>

      <!---->
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.7938'?>

      <?rfc include="reference.I-D.xu-lsr-fare"?>

      <?rfc include="reference.I-D.ietf-idr-link-bandwidth"?>

      <!---->
    </references>
  </back>
</rfc>
