<?xml version="1.0" encoding="utf-8"?>
<!-- 
     draft-rfcxml-general-template-standard-00
  
     This template includes examples of the most commonly used features of RFCXML with comments 
     explaining how to customise them. This template can be quickly turned into an I-D by editing 
     the examples provided. Look for [REPLACE], [REPLACE/DELETE], [CHECK] and edit accordingly.
     Note - 'DELETE' means delete the element or attribute, not just the contents.
     
     Documentation is at https://authors.ietf.org/en/templates-and-schemas
-->
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including most browsers -->


<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<!-- If further character entities are required then they should be added to the DOCTYPE above.
     Use of an external entity file is not recommended. -->

<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="info"
  docName="draft-hu-rtgwg-cbfc-rsvp-00"
  ipr="trust200902"
  obsoletes=""
  updates=""
  submissionType="IETF"
  xml:lang="en"
  version="3">
<!-- [REPLACE] 
       * docName with name of your draft
     [CHECK] 
       * category should be one of std, bcp, info, exp, historic
       * ipr should be one of trust200902, noModificationTrust200902, noDerivativesTrust200902, pre5378Trust200902
       * updates can be an RFC number as NNNN
       * obsoletes can be an RFC number as NNNN 
-->

  <front>
    <title abbrev="draft-hu-rtgwg-cbfc-rsvp-00">Credit-based Flow Control for Cross-AIDC WAN
      transmission Based on RSVP
    </title>
    <!--  [REPLACE/DELETE] abbrev. The abbreviated title is required if the full title is longer than 39 characters -->

    <seriesInfo name="Internet-Draft" value="draft-hu-rtgwg-cbfc-rsvp-00"/>
   
    <author fullname="Jiayuan Hu" initials="Jiayuan" role="editor" surname="Hu">
      <!-- [CHECK]
             * initials should not include an initial for the surname
             * role="editor" is optional -->
    <!-- Can have more than one author -->
      
    <!-- all of the following elements are optional -->
      <organization>China Telecom</organization>
      <address>
        <postal>
          <!-- Reorder these if your country does things differently -->
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangzhou</region>
          <code>510000</code>
          <country>CN</country>
          <!-- Uses two letter country code -->
        </postal>
        <email>hujy5@chinatelecom.cn</email>
        <!-- Can have more than one <email> element -->
      </address>
    </author>
   
    <date year="2025"/>
    <!-- On draft subbmission:
         * If only the current year is specified, the current day and month will be used.
         * If the month and year are both specified and are the current ones, the current day will
           be used
         * If the year is not the current one, it is necessary to specify at least a month and day="1" will be used.
    -->

    <area>Routing</area>
    <workgroup>Routing Area Working Group</workgroup>
    <!-- "Internet Engineering Task Force" is fine for individual submissions.  If this element is 
          not present, the default is "Network Working Group", which is used by the RFC Editor as 
          a nod to the history of the RFC Series. -->

    <keyword>RFC</keyword>
    <!-- [REPLACE/DELETE]. Multiple allowed.  Keywords are incorporated into HTML output files for 
         use by search engines. -->

    <abstract>
      <t>This draft defines the Credit-based flow control mechanism for WAN
        based on the RSVP protocol. With the increasing demand for AI computing
        power, the computing power of a single AIDC can no longer meet the needs
        of large model training. This has given rise to cross-AIDC distributed model
        training, driving the demand for transmitting RoCEv2 packets over WAN
        networks. AI training is extremely sensitive to network packet loss,
        and even a small amount of packet loss may lead to a significant decline
        in training efficiency. In addition, the elephant flow and extreme
        concurrent traffic also place higher demands on network performance.
        Credit-based flow control is a Backpressure-based traffic management
        technology, which has high reliability and stability in practical
        applications. It can provide high-throughput and zero-packet-loss
        transmission guarantees for RoCEv2 traffic, effectively ensuring the
        efficiency of cross-data center AI training.</t>
      <t>This draft focuses on the scenario where RoCEv2 packets are transmitted
        through SRv6 tunnels in the WAN and further expands the capabilities of
        the RSVP protocol in WAN. This draft introduces the Credit-based flow
        control mechanism into the RSVP protocol to achieve precise traffic
        control and provides processing analysis.</t>
    </abstract>
 
  </front>

  <middle>
    
    <section>
      <name>Introduction</name>
      <t>
        The exponential growth of AI computing power demands, especially for large-scale
        model training, has transformed the data center landscape. The current single AIDC
        can no longer meet the needs of large-scale model training. The training parameters
        of large models have skyrocketed in the past five years, reaching the trillion
        level, and are expected to increase a hundred-fold in the next five years, reaching
        the quadrillion level. This has led to the rise of cross-AIDC distributed model
        training, which, in turn, drives the need to transmit RoCEv2 packets across WANs.
      </t>
      <t>
        AI training is highly sensitive to network packet loss. Even a small amount of packet
        loss can significantly reduce training efficiency. Additionally, the presence of elephant
        flow and extreme concurrent traffic poses greater challenges to network performance. To
        address these issues, this Draft focuses on a Credit-based flow control mechanism for
        WAN transmission.
      </t>
      <t>
        Credit-based flow control is a Backpressure-based traffic management technology. It
        has demonstrated high reliability and stability in practical applications, capable of
        providing high-speed and zero-packet-loss transmission guarantees for RoCEv2 traffic.
        This effectively ensures the efficiency of cross-AIDC AI training.
      </t>
      <t>
        This draft centers on the scenario where RoCEv2 packets are transmitted through SRv6
        tunnels in the WAN. It aims to expand the capabilities of the RSVP in WAN environments.
        By introducing the Credit-based flow control mechanism into the RSVP protocol, the draft
        enables precise traffic control and provides in-depth processing analysis. The RSVP, while
        originally designed to reserve resources for data flows, has limitations such as
        scalability issues with increasing reservations, assuming a static network topology, and
        potential resource waste and service interruptions in case of node failures. The proposed
        solution redefines an RSVP option to implement the Credit-based flow control, which is
        detailed in subsequent sections through the initial process, data transmission process,
        and termination process.
      </t>
    </section>
      
    <section title="Conventions Used in This Document">
      <section>
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
          RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119"/>
          <xref target="RFC8174"/> when, and only when, they appear in
          all capitals, as shown here.</t>
      </section>

      <section title="Abbreviations">
        <t> AIDC: Artificial Intelligence Data Center</t>
        <t> RoCEv2: RDMA over Converged Ethernet version 2</t>
        <t> RSVP: Resource Reservation Protocol</t>
        <t> MB: Megabytes</t>
      </section>
    </section>
      <!-- [CHECK] The 'Requirements Language' section is optional -->
    
    <section>
      <name>Scenarios for distributed AI training network</name>
      <section>
        <name>Distributed model training for difference AIDC</name>
        <t>
          Industry insiders can all feel the accelerating pace of the development of large
          AI models. Mainstream technology companies are developing large models and iterating
          new versions as quickly as possible, in the hope of gaining a head start in this
          brand-new industry. The training parameters of large models have increased a hundredfold
          in the past five years and have reached the trillion level. It is expected that the
          parameters will increase a hundredfold again in the next five years, reaching the
          quadrillion level. The intelligent computing power has also been rapidly upgraded
          accordingly. Currently, a single data center has reached the scale of a ten-thousand
          GPU cluster to try its best to meet the almost endless AI computing demands.
        </t>
        <t>
          An AI cluster is formed by connecting multiple computer nodes to create a collaborative
          computing environment, thus providing powerful computing power and data processing
          capabilities for artificial intelligence applications. However, any specific thing
          has its limits, and computing power clusters are no exception. A single AI cluster
          cannot expand without limit, as it will be affected by factors such as power supply and
          the size of the geographical location. In addition, cloud computing has a peak-valley
          effect, and the computing power of a single cluster faces the problem of fragmented
          deployment, making it difficult to bear large-scale AI training business and leading
          to a decrease in resource utilization.
        </t>
        <t>
          Facing the problem of fragmented AI resources on the cloud, Microsoft has proposed the
          "Singularity" framework, which enables planet-scale pre-emptible, migratable, and elastically
          scalable AI task scheduling. This framework can achieve high elasticity and migratability
          in resource scheduling and increase the utilization rate of AI resources on the cloud, but
          it lacks attention to the training performance across clusters. Facing the problem of
          heterogeneous AI training networks in public clouds, AWS has proposed the MiCS solution,
          which can make full use of heterogeneous network bandwidth. By reducing the network
          traffic on slower links, it can amortize the expensive global gradient synchronization
          overhead. To solve the problem of the high cost of building AI training clusters, Meta has
          proposed decentralized heterogeneous training. It uses distributed, heterogeneous, and
          low-bandwidth interconnected AI training resources to train basic large models, reducing
          the training cost. Therefore, model training across AIDCs is an important stage in the
          development of artificial intelligence, and corresponding network capabilities are
          required to ensure its implementation.
        </t>
      </section>
      <section>
        <name>Separated storage and model training</name>
        <t>In the traditional data processing architecture, computing and storage are often closely
          coupled. Although this integrated computing and storage architecture performed excellently
          in the early data requirements, with the surge in business complexity and data volume,
          some insurmountable defects have gradually emerged: When adding a new computing node,
          data (including MetaData and Data) needs to be synchronized between nodes, resulting in
          low expansion efficiency and increasing the complexity and management cost of the system.
          Since computing and storage resources are closely coupled, when the business load changes,
          it is impossible to dynamically allocate resources according to the business load, flexibly
          adjust computing and storage resources, resulting in low resource utilization and difficult
          cost control.
        </t>
        <t>
          In the scenario of separating storage from computing, even a tiny data packet loss can
          seriously disrupt the progress of training. Therefore, efficient storage data transmission
          and zero packet loss are of great importance, which also places higher demands on the network.
        </t>
      </section>
    </section>

    <section title="Forwarding Plan Solution">
      <t>

      </t>
      <section>
        <name>RSVP protocol extend</name>
        <t>
          RSVP is a signaling protocol that enables end systems and network
          devices to reserve resources along a data path. <xref target="RFC2205"/> It allows applications
          to request specific levels of QoS, such as bandwidth, delay, and packet
          loss rate, for their data flows. However, It has scalability issues as the
          number of reservations increases. Additionally, RSVP assumes a static
          network topology, and it may not adapt well to highly dynamic networks.
          The following RSVP common header has been provided in RFC2205. <xref target="RFC2205"/>
        </t>
        <figure>
          <name>Common Header</name>
          <artwork align="center"><![CDATA[
                   0             1              2             3
         +-------------+-------------+-------------+-------------+
         | Vers | Flags|  Msg Type   |       RSVP Checksum       |
         +-------------+-------------+-------------+-------------+
         |  Send_TTL   | (Reserved)  |        RSVP Length        |
         +-------------+-------------+-------------+-------------+
          ]]>
          </artwork>
        </figure>
        <t>
          At the same time, in the resource reservation process of the RSVP, even
          if the data flow is not successfully established in the end, the reserved
          resources will not be released, resulting in resource waste. The RSVP
          protocol relies on specific nodes for the management and maintenance of
          resource reservation. If these nodes fail, it may lead to the interruption
          of the entire resource reservation process. In order to solve the problems
          of the RSVP protocol, such as its inability to dynamically adjust resource
          reservation and the service guarantee issues that may occur due to possible
          failures, this draft defines a new RSVP option, which is used to implement
          credit-based flow control. The detail process is divided into
          three steps: the initial process, the data transmission process,
          and the termination process.
        </t>
        <section>
          <name>Initial process</name>
          <t>
            Before transmitting RoCEv2 flow between servers in different AIDC, a transmission
            channel with service guarantee capabilities needs to be established between the servers.
            The traditional RSVP  can only guarantee the quality of service in a fixed manner
            according to specific service information. The RSVP based on credit is similar to the
            traditional RSVP in initialization. The sender sends a "Credit-based PATH" detection message to
            the receiver, this message will flood to every path that can reach the receiver, and this
            message contains the data flow identifier. When the devices
            on the path receive the RSVP message packet, they will all send a "Credit-based RESV" message
            to the upstream device in the path with the credit value. The credit value is
            stored in the object contents field of the RSVP message. It represents the buffer reserved
            by the device for the forwarding task, and the credit value should be less than the buffer.
          </t>
          <t>
            The new msg type fields in the common header are as follows:
          </t>
          <t>28 = Credit-based PATH</t>
          <t>29 = Credit-based RESV</t>
          <t>
            The RSVP protocol defines the Object field for expansion for different message types. Every
            object consists of one or more 32-bit words with a one-word header, and the format is as follows:
          </t>
          <figure>
            <name>Object Formats</name>
            <artwork align="center"><![CDATA[
                 0             1              2             3
         +-------------+-------------+-------------+-------------+
         |       Length (bytes)      |  Class-Num  |   C-Type    |
         +-------------+-------------+-------------+-------------+
         |                                                       |
         //                  (Object contents)                   //
         |                                                       |
         +-------------+-------------+-------------+-------------+
          ]]>
            </artwork>
          </figure>
          <t>
            The object information in the Credit-based PATH message is the same as that in the traditional
            PATH message, but the Credit-based RESV message is different. There are newly defined types
            in Class-Num and C-Type, the following new classes are defined in Class-Num:
          </t>
          <t>
            Credit : It represents the initial cache reserved by the device for the forwarding task,
            with the unit being MB.
          </t>
        </section>

        <section>
          <name>Data transmission process</name>
          <t>
            After the resource reservation initial process is completed, each network device will maintain
            a credit value. This credit value is equal to the credit value replied by the next-hop device
            on the path during initialization. At this time, the sender server can start sending data.
            The forwarding devices in the path randomly forward data packets smaller than the credit value
            they maintain. After sending, they subtract
            the size of the forwarded data packet from the credit value they maintain. When the credit
            value is not 0, they will continue to send; when the credit value is 0, they will stop sending.
          </t>
          <t>
            In contrast, after the next-hop forwarding device receives the data packet, according to the first
            in first out principle, the device forwards the data block in the buffer, it will reply with
            a Credit-based RESV packet to the previous-hop forwarding device. This packet carries a new credit
            value, which is the size of the data packet it has just forwarded (indicating that a new buffer
            space has become available).
          </t>
          <t>
            After each data transmission, the device should receive a message carrying the credit value in
            reply from the next-hop device. When a failure occurs in the link or the next-hop device, it
            may not be possible to receive the reply message. If the reply message is still not received
            after the preset heartbeat time elapses, it can be considered that a failure has occurred in
            the next-hop device or the link. The forwarding device will forward the traffic to the backup
            path according to the acknowledgment message received during the initialization process, thereby
            ensuring the non-loss transmission of the traffic.
          </t>
        </section>

        <section>
          <name>Termination process</name>
          <t>
            After the data transmission is completed, the source device needs to send an RSVP message
            representing the end of the task to flood along all reachable paths. Upon receiving this
            message, the devices on the paths will terminate the corresponding guarantee tasks.
          </t>
          <t>TBC</t>
        </section>
      </section>
    </section>
    
    <section anchor="IANA">
    <!-- All drafts are required to have an IANA considerations section. See RFC 8126 for a guide.-->
      <name>IANA Considerations</name>
      <t>TBC</t>
    </section>
    
    <section anchor="Security">
      <!-- All drafts are required to have a security considerations section. See RFC 3552 for a guide. -->
      <name>Security Considerations</name>
      <t>TBC</t>
    </section>
    
    <!-- NOTE: The Acknowledgements and Contributors sections are at the end of this template -->
  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2205.xml"/>
        <!-- The recommended and simplest way to include a well known reference -->
        
      </references>
    </references>
    
    <section anchor="Contributors" numbered="false">
      <!-- [REPLACE/DELETE] a Contributors section is optional -->
      <name>Contributors</name>
      <t>Thanks to all of the contributors.</t>
      <!-- [CHECK] it is optional to add a <contact> record for some or all contributors -->
    </section>
    
 </back>
</rfc>
