<?xml version="1.0" encoding="iso-8859-1" ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<rfc category="std" ipr="trust200902" docName="draft-xiao-rtgwg-rocev2-fast-cnp-01" consensus="true" submissionType="IETF">

<front>
        <title abbrev="Fast CNP in RoCEv2 Networks"> Fast Congestion Notification Packet (CNP) in RoCEv2 Networks </title>
 
  <author fullname="Xiao Min" initials="X" surname="Min">
      <organization>ZTE Corp.</organization>
     <address>
       <postal>
         <street/>

         <!-- Reorder these if your country does things differently -->

         <city>Nanjing</city>

         <region/>

         <code/>

         <country>China</country>
       </postal>

       <phone>+86 18061680168</phone>

       <email>xiao.min2@zte.com.cn</email>

       <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
  <author fullname="Hesong Li" initials="H" surname="Li">
      <organization>ZTE Corp.</organization>
     <address>
       <postal>
         <street/>

         <!-- Reorder these if your country does things differently -->

         <city>Wuhan</city>

         <region/>

         <code/>

         <country>China</country>
       </postal>

       <phone/>

       <email>li.hesong@zte.com.cn</email>

       <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
  <author initials="L." surname="Iannone" fullname="Luigi Iannone">
      <organization abbrev="Huawei">Huawei Technologies France S.A.S.U.</organization>
      <address>
        <postal>
          <street>18, Quai du Point du Jour</street>
          <city>Boulogne-Billancourt</city>
          <code>92100</code>
          <country>France</country>
        </postal>
        <email>luigi.iannone@huawei.com</email>
      </address>
    </author>

    <date year="2024"/>
  
    <area>Routing</area>
    <workgroup>RTGWG Working Group</workgroup>

    <keyword>Request for Comments</keyword>
    <keyword>RFC</keyword>
    <keyword>Internet Draft</keyword>
    <keyword>I-D</keyword>

    <abstract>
  <t> This document describes a Remote Direct Memory Access (RDMA) over Converged Ethernet version 2 (RoCEv2) congestion control 
  mechanism, which is inspired by Really Explicit Congestion Notification (RECN) described in RFC 7514, also known as Fast Congestion 
  Notification Packet (CNP). By extending the RoCEv2 CNP, Fast CNP can be sent by the switches directly to the sender, advising the 
  sender to reduce the transmission rate at which it sends the flow of RoCEv2 data traffic. </t>
    </abstract>
    
</front>
  
<middle>

  <section title="Introduction">
  
  <t> Remote Direct Memory Access (RDMA) is a method of accessing memory on a remote system without interrupting the processing of the 
  Central Processing Unit (CPU) on that system. RDMA enables lower latency and higher throughput on the network and lower CPU utilization 
  for the servers and storage systems. High Performance Computing (HPC) and Artificial Intelligence (AI) applications can be accelerated 
  by RDMA. </t>
  
  <t> InfiniBand is a lossless network optimized for HPC and AI. It typically supports RDMA enabling machines to communicate and 
  share data without interrupting the host CPU. </t>
  
  <t> RDMA over Converged Ethernet (RoCE) is an open standard enabling RDMA and network offloads over an Ethernet network. The 
  current and most popular implementation is RDMA over Converged Ethernet version 2 (RoCEv2) <xref target="IBTA-Spec"/>. RoCEv2 
  runs the InfiniBand transport layer over UDP and IP protocols on an Ethernet network, bringing many of the advantages of InfiniBand 
  to Ethernet networks. </t>
  
  <t> The RoCEv2 networks often implement a proactive congestion control mechanism analogous to Explicit Congestion Notification 
  (ECN) <xref target="RFC3168"/>, in which the switches mark packets if congestion occurs in the network. The marked packets alert the 
  receiver that congestion is imminent, and the receiver alerts the sender with a Congestion Notification Packet (CNP). After receiving 
  the CNP, the sender knows to back off, slowing down the transmission rate temporarily until the flow path is ready to handle a higher 
  rate of traffic. </t>
  
  <t> This document describes a RoCEv2 congestion control mechanism, which is inspired by Really Explicit Congestion Notification (RECN) 
  <xref target="RFC7514"/>, also known as Fast CNP. By extending the RoCEv2 CNP, Fast CNP can be sent by the switches directly to the sender, 
  advising the sender to reduce the transmission rate at which it sends the flow of RoCEv2 data traffic. The primary benefit of Fast CNP has 
  been explicitly indicated by its name saying that it's faster than the receiver-originated CNP. </t>
  
  </section>
  
  <section title="Conventions Used in This Document">
   
    <section title="Abbreviations">
      <t> AI: Artificial Intelligence</t>
      <t> CNP: Congestion Notification Packet</t>
      <t> CPU: Central Processing Unit</t>
      <t> DoS: Denial-of-Service</t>
      <t> ECN: Explicit Congestion Notification</t>
      <t> ECMP: Equal-Cost Multipath</t>
      <t> HPC: High Performance Computing</t>
      <t> HPCC++: Enhanced High Precision Congestion Control</t>
      <t> IBTA: InfiniBand Trade Association</t>
      <t> IOAM: In situ Operations, Administration, and Maintenance</t>
      <t> RDMA: Remote Direct Memory Access</t>
      <t> RECN: Really Explicit Congestion Notification</t>
      <t> RoCE: RDMA over Converged Ethernet</t>
      <t> RoCEv2: RDMA over Converged Ethernet version 2</t>
    </section>
  
    <section title="Requirements Language">  
	  <t> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", 
	  and   "OPTIONAL" in this document are to be interpreted as described in BCP 14  <xref target="RFC2119"/> <xref target="RFC8174"/> when, 
	  and only when, they appear in all capitals, as shown here.</t>	
    </section>
  
  </section>
  
  <section title="RoCEv2 Data Packet and CNP formats">

  <t> RoCEv2 packets use a well-known UDP Destination Port number 4791 that unambiguously distinguishes them in a stateless manner. 
  RoCEv2 data packet format and RoCEv2 Congestion Notification Packet (CNP) format are shown in Figure 1 and Figure 2 respectively.</t>
  
  <figure anchor="Figure_1" title="RoCEv2 Data Packet Format">
  <artwork align="left"> <![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                        Ethernet Header                        ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                          IPv6 Header                          ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                           UDP Header                          ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                 InfiniBand Transport Header(s)                |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                            Payload                            ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Invariant CRC                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              FCS                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
  <t> In a RoCEv2 data packet, the InfiniBand Transport Header(s) must start with an InfiniBand Base Transport Header, followed by 0, 1, or 
  multiple InfiniBand Extended Transport Header(s). </t>
  
  <t> Within the InfiniBand Base Transport Header, there is a 24-bit field called Destination Queue Pair (QP), indicating the Work Queue Pair 
  Number at the destination. The QP is the virtual interface that the hardware provides to an InfiniBand architecture consumer, and it serves 
  as a virtual communication port for the consumer. The operation on each QP is independent from the others. </t>
  
  <t> Note that in order to save the space, the Source QP indicating the Work Queue Pair at the source is not included in the InfiniBand Base 
  Transport Header. It's assumed that the receiver is able to figure out the Source QP of a RoCEv2 data packet, because both the sender and the 
  receiver of a RoCEv2 data packet know the mapping between the Source QP and the Destination QP. </t>
  
  <figure anchor="Figure_2" title="RoCEv2 Congestion Notification Packet Format">
  <artwork align="left"> <![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                        Ethernet Header                        ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                          IPv6 Header                          ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
~                           UDP Header                          ~
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                InfiniBand Base Transport Header               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                            Reserved                           |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Invariant CRC                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              FCS                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
  <t> In a RoCEv2 congestion notification packet, only the InfiniBand Base Transport Header but no any other InfiniBand Transport Header is 
  present, following the IP/UDP headers. In this document, only IPv6 is taken into account while IPv4 is beyond the scope. </t>
  
  <t> The RoCEv2 CNP is generated by the receiver when the receiver receives RoCEv2 data packet with ECN bits set. The field Destination QP 
  within the InfiniBand Base Transport Header is set to the Work Queue Pair Number at the sender, corresponding to the Destination QP of the 
  RoCEv2 data packet received by the receiver. </t>

  <t> After the sender receives the RoCEv2 CNP, the sender would reduce the transmission rate at which it sends the RoCEv2 data packets using 
  the Destination QP of the RoCEv2 CNP. The congestion control algorithm used by the sender to reduce the transmission rate is outside the 
  scope of this document. </t>
  
  <t> Fast CNP is an extended CNP generated by the switch at which congestion occurs, but not generated by the receiver. For a RoCEv2 CNP, it's 
  sent by a receiver and the Source QP of the sender is populated in the Destination QP field of the CNP, so it's easy for the sender to know 
  the Source QP after receiving the CNP. However, for a Fast CNP, it's sent by a switch and the original Destination QP of the receiver is 
  populated in the Destination QP field of the Fast CNP, which is not enough for the sender to know the Source QP. The reason is that the 
  sender can communicate with multiple receivers, while different receivers may use the same Destination QP mapped to different Source QP at 
  the sender. This document proposes to prepend an IPv6 extension header containing the original Destination Address to the RoCEv2 CNP, enabling 
  the sender to figure out the Source QP by the combination of the original Destination QP and the original Destination Address (i.e., the receiver's 
  address). </t>
  
  </section>
  
  <section title="IPv6 Destination Options for Fast CNP">

  <t> The switch would send Fast CNP to the sender of RoCEv2 data packet causing congestion. If the switch doesn't know about whether the sender 
  is able to process the Fast CNP, then the switch MAY choose to mark the ECN bits of the RoCEv2 data packet at the same time of sending Fast CNP. 
  The marked ECN bits of the RoCEv2 data packet would cause the receiver to send RoCEv2 CNP to the sender. In this case, the sender would receive 
  both the Fast CNP and the receiver-originated CNP. If the switch knows that the sender is able to process the Fast CNP, then the switch MUST NOT 
  mark the ECN bits of the RoCEv2 data packet at the same time of sending Fast CNP. How the switch can know the sender's capability of processing 
  Fast CNP is outside the scope of this document.</t>
  
  <t> Fast CNP's Source IPv6 address is set to the IPv6 loopback address of the switch which sends the Fast CNP, and the Destination IPv6 address 
  of the Fast CNP is copied from the Source IPv6 address of the RoCEv2 data packet causing congestion. After the sender receives a Fast CNP, the 
  sender can optionally use the Source IPv6 address to check whether it's a Fast CNP as defined in this document. To this end it will compare the 
  source address of the received Fast CNP with the intended destination address of the original packet, which can be found in the IPv6 extension 
  header appended in the packet. If these two addresses are the same, then it is a receiver-originated Fast CNP, otherwise, if the two addresses 
  differ, it means that the sender of the Fast CNP is a network element along the path. Whether it's necessary for the receiver to send Fast CNP 
  is outside the scope of this document, while that's not excluded by this document. Furthermore, if the sender knows how to detour the congested 
  switch (e.g., by changing the ECMP field(s) of the flow of RoCEv2 data packets that were subject to forward congestion), then the sender can also 
  use the Source IPv6 address of the Fast CNP as a signal to detour the congested switch. </t>
    
  <t> Fast CNP's field Destination QP within the InfiniBand Base Transport Header is copied from the field Destination QP within the InfiniBand Base 
  Transport Header of the RoCEv2 data packet causing congestion. </t>
  
  <t> Fast CNP adds an IPv6 extension header <xref target="RFC8200"/> to the RoCEv2 CNP, specifically, an IPv6 Destination Options header with one 
  IPv6 destination option is added. There are two types of IPv6 destination option which can be added. </t>
  
  <t> When the RoCEv2 data packet causing congestion doesn't carry an IPv6 In situ OAM (IOAM) Hop-by-Hop Trace Option <xref target="RFC9486"/>, the 
  following IPv6 destination option is carried in the Fast CNP. </t>

  <figure anchor="Figure_3" title="IPv6 Destination Option Format for Carrying Destination IPv6 address of the Congested RoCEv2 Data Packet">
  <artwork align="left"> <![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |  Option Type  |  Opt Data Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|       Destination IPv6 address of the RoCEv2 data packet      |
|              that was subject to forward congestion           |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
  <t> Option Type: 8-bit identifier of the type of Option that needs to be allocated. <xref target="RFC8200"/> defines how to encode the three 
  high-order bits of the Option Type field.  The two high-order bits specify the action that must be taken if the processing IPv6 node does not 
  recognize the Option Type; for this Option, these two bits MUST be set to 10 (discard the packet and, regardless of whether or not the packet's 
  Destination Address was a multicast address, send an ICMP Parameter Problem, Code 2, message to the packet's Source Address, pointing to the 
  unrecognized Option Type).  The third-highest-order bit specifies whether the Option Data can change en route to the packet's final destination; 
  for this Option, the value of this bit MUST be set to 0 (Option Data does not change en route). </t>
  
  <t> Opt Data Len: 16. It is the length of the Option Data Field of this Option in bytes. </t>
  
  <t> Option Data: Destination IPv6 address of the RoCEv2 data packet that was subject to forward congestion. The Option Data, combined with the 
  Destination QP within the InfiniBand Base Transport Header, are used by the sender to obtain the Work Queue Pair Number for which the transmission 
  rate would be reduced. </t>
  
  <t> When the RoCEv2 data packet causing congestion carries an IPv6 IOAM Hop-by-Hop Trace Option, the following IPv6 destination option is carried 
  in the Fast CNP. </t>
  
  <figure anchor="Figure_4" title="IPv6 Destination Option Format for Carrying IOAM Option and Destination IPv6 address of the Congested RoCEv2 Data Packet">
  <artwork align="left"> <![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Option Type  |  Opt Data Len |   Reserved    | IOAM Opt-Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
|                                                               |  |
.                                                               .  I
.                                                               .  O
.                                                               .  A
.                                                               .  M
.                                                               .  .
.                          Option Data                          .  O
.                                                               .  P
.                                                               .  T
.                                                               .  I
.                                                               .  O
.                                                               .  N
|                                                               |  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
|                                                               |
|       Destination IPv6 address of the RoCEv2 data packet      |
|              that was subject to forward congestion           |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
  <t> Option Type: 8-bit identifier of the type of Option that needs to be allocated.  For this Option, the two high-order bits MUST be set to 10 
  (discard the packet and, regardless of whether or not the packet's Destination Address was a multicast address, send an ICMP Parameter Problem, 
  Code 2, message to the packet's Source Address, pointing to the unrecognized Option Type).  The third-highest-order bit MUST be set to 0 (Option 
  Data does not change en route). </t>
  
  <t> Opt Data Len: 8-bit unsigned integer. It is the length of the Option Data Field of this Option in bytes. </t>
  
  <t> Option Data: IOAM Trace Option Data and Destination IPv6 address of the RoCEv2 data packet that was subject to forward congestion. IOAM Trace 
  Option Data is copied from the IPv6 Hop-by-Hop Options header of the RoCEv2 data packet. The Destination IPv6 address of the RoCEv2 data packet, 
  combined with the Destination QP within the InfiniBand Base Transport Header, are used by the sender to obtain the Work Queue Pair Number for which 
  the transmission rate would be reduced. The IOAM Trace Option Data is used by the sender to decide how to reduce the transmission rate, based on a 
  congestion control algorithm. One example of the IOAM Trace Option Data and the congestion control algorithm is Enhanced High Precision Congestion 
  Control (HPCC++) <xref target="I-D.miao-ccwg-hpcc"/> <xref target="I-D.miao-ccwg-hpcc-info"/>. </t>
  
  </section>
  
  <section title="Security Considerations">
  
  <t> The Fast CNP MUST be applied in a specific controlled domain. A limited administrative domain provides the network administrator with the means 
  to select, monitor, and control the access to the network, making it a trusted domain.</t>
   
  <t> To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting policies when generating Fast CNPs.</t>

  <t> To protect against unauthorized sources sending Fast CNP to the host, implementations MUST provide a means of checking the source addresses of 
  Fast CNP against an access list before accepting the packet. For instance using <xref target="I-D.ietf-savnet-intra-domain-architecture"/>. </t>
  
  <t> A deployment MUST ensure that border-filtering drops inbound Fast CNP from outside of the domain and that drops outbound Fast CNP leaving the domain.</t>
  
  <t> A deployment MUST support the configuration option to enable or disable the Fast CNP feature defined in this document. By default, the Fast CNP 
  feature MUST be disabled.</t>
  
  <t> As this document describes new options for IPv6, containing IOAM data or not, the security considerations of <xref target="RFC8200"/>, <xref target="RFC9098"/>, 
  and <xref target="RFC9486"/> apply.</t>
  
  </section>
  
  <section title="IANA Considerations"> 
  
  <t>  This document requests the following IPv6 Option Type assignments from the Destination Options and Hop-by-Hop Options sub-registry of Internet Protocol 
  Version 6 (IPv6) Parameters (https://www.iana.org/assignments/ipv6-parameters/).</t>
  
  <figure>
  <artwork><![CDATA[
Hex Value Binary Value Description                  Reference
          act chg rest
----------------------------------------------------------------
TBD1	  10   0  tbd1 Fast CNP Destination Option1 [This draft]
TBD2	  10   0  tbd2 Fast CNP Destination Option2 [This draft]

                            Table 1
  ]]></artwork>
  </figure>
  
  </section>

  <section title="Acknowledgements">
  <t> TBD. </t>
  </section>  
  
</middle>
  
<back>
    <references title="Normative References">
     <?rfc include="reference.RFC.2119"?>
     <?rfc include="reference.RFC.8174"?>
     <?rfc include="reference.RFC.8200"?>
     <?rfc include="reference.RFC.9486"?>
    </references>
	
    <references title="Informative References">
     <?rfc include="reference.RFC.3168"?>
     <?rfc include="reference.RFC.7514"?>
     <?rfc include="reference.RFC.9098"?>
     <?rfc include="reference.I-D.miao-ccwg-hpcc"?>
     <?rfc include="reference.I-D.miao-ccwg-hpcc-info"?>
     <?rfc include="reference.I-D.ietf-savnet-intra-domain-architecture"?>
     <reference anchor="IBTA-Spec"
                 target="https://www.infinibandta.org/ibta-specification/">
        <front>
          <title>InfiniBand Architecture Specification Volume 1, Release 1.4 </title>

          <author>
            <organization>InfiniBand Trade Association</organization>
          </author>

          <date year="2020"/>
        </front>
     </reference>
    </references>	
</back>

</rfc>
