<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<rfc
      xmlns:xi="http://www.w3.org/2001/XInclude"
      category="info"
      docName="draft-zhao-hpwan-scenarios-deployment-00"
      ipr="trust200902"
      obsoletes=""
      updates=""
      submissionType="IETF"
      xml:lang="en"
      tocInclude="true"
      tocDepth="4"
      symRefs="true"
      sortRefs="true"
      version="3">

 <!-- ***** FRONT MATTER ***** -->

 <front>

   <title abbrev="Scenarios and Deployment Considerations for High Performance Wide Area Network">Scenarios and Deployment Considerations for High Performance Wide Area Network</title>
    <seriesInfo name="Internet-Draft" value="draft-zhao-hpwan-scenarios-deployment-00"/>
   
 	<author fullname="Junfeng Zhao" initials="J" surname="Zhao">
      <organization>CAICT</organization>

      <address>
        <postal>
          <street></street>
          
          <city>Beijing</city>
          
          <region></region>
  
          <code></code>

          <country>China</country>
        </postal>

        <phone></phone>

        <email>zhaojunfeng@caict.ac.cn</email>
      </address>
    </author>	  
   
   <author fullname="Quan Xiong" initials="Q" surname="Xiong">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>xiong.quan@zte.com.cn</email>
     </address>
    </author>

   <area>Wit</area>
    <workgroup></workgroup>
   <keyword></keyword>
   
   <abstract>
   
   <t>This document describes the typical scenarios and deployment
   considerations for High Performance Wide Area Networks (HP-WANs).
   It also provides simulation results for data transmission in WANs
   and analyses the impacts on throughput..</t>
	  
    </abstract>
  </front>
  <middle>
   <section numbered="true" toc="default"> <name>Introduction</name>
   
   <t>As per <xref target="I-D.xiong-hpwan-uc-req-problem" pageno="false" format="default"/>,
   High Performance Wide Area Network (HP-WAN) puts forward higher 
   performance requirements for WANs. The high performance data
   transmission should provide the advantages of low latency, 
   high throughput and low CPU utilization, which can significantly
   improve the performance and efficiency of the intra-DC and DC
   interconnection network. At present, the tests and deployments 
   of long-distance, high-performance data transmission 
   have been carried out among the operators WAN, cloud service 
   providers DC interconnection network and research institutions
   private network. However, there are still challenges
   in providing high performance in long-distance and wide area 
   networks deployment:</t>
   
   <ul spacing="normal">
   <li>the high utilization and high throughput capabilities for 
   long-distance links;</li>
   
   <li>the efficient congestion control mechanisms to avoid packet loss;</li>
   
   <li>fair sharing of link bandwidth resources among multiple
   concurrent applications;</li>
   
   <li>the packet ACK delay increases exponentially with distance, 
   which will be challenging for high-performance applications, especially
   distributed processing models.</li>
   </ul>

   
    <t>This document describes the typical scenarios and deployment
   considerations for High Performance Wide Area Networks (HP-WANs).
   It also provides simulation results for data transmission in WANs
   and analyses the impacts on throughput.</t>
    
      <section numbered="true" toc="default"><name>Requirements Language</name>
	  
	 <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
       "OPTIONAL" in this document are to be interpreted as described in BCP
       14 <xref target="RFC2119" pageno="false" format="default"/> 
	   <xref target="RFC8174" pageno="false" format="default"/> when, and only when, 
	   they appear in all capitals, as shown here.</t>
	   
      </section>
    </section>
	
    <section anchor="Terminology" numbered="true" toc="default"> <name>Terminology</name>
	<t>The terminology is defined as <xref target="I-D.xiong-hpwan-uc-req-problem" pageno="false" format="default"/>.</t>
    </section>
	
    <section numbered="true" toc="default"><name>Typical Scenarios for HP-WANs</name>
	
	<t>According to different transmission distances and deployment 
    requirements, the high-throughput transmission includes two types 
	of scenarios: high volume data transmission over thousands of 
	kilometers in WANs and the collaborative data transmission over
	hundreds of kilometers in MANs.</t>
 
    <section numbered="true" toc="default"> <name>Long-distance Data Transmission</name>
	
	<t>There are two types of scenarios: massive research data transmission 
	between HPCs and data transmission of training samples between the 
	DCs for AI. The long-distance data transmission scenario is shown in 
	Figure 1, where the data flows are transmitted between two sites 
	or DCs, with a location distance ranging from 100km to 1000km.</t>
	
  <figure title="Long-distance Data Transmission over WANs" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height="">	   
  
                    +---100km~1000km---+
                    |                  |  
    +--------+      |                  |      +--------+ 
    | Host A |------+       WAN        +------| Host B |
    +--------+      |                  |      +--------+
     Site/DC        |                  |       Site/DC 
                    +------------------+
  
	   </artwork>
    </figure> 	
	
	<t>Massive research data transmission between HPCs: The scenario
    of thousands of kilometers of big data migration mainly refers
    to the high-throughput transmission of massive data between 
    scientific research institutions. At present, research institutions
    in some countries, such as the US ESnet6 and the EU EuroHPC program, 
    are deploying wide area high-performance networks to support the 
    construction and operation of high-performance computing and data 
    interconnection infrastructure. In this scenario, data transmission
    is usually carried out regularly or in demand, with each transmission 
    ranging from a few terabytes to several hundred terabytes, Data 
    transmission costs and security is required to balance.</t>
	
	<t>Data transmission of training samples between the DCs for AI: 
	The construction of the large-scale DC for AI is limited by energy 
	and land resources. Allocating training tasks to data centers with
	lower computing power and electricity prices has become a cost-effective
	option. When the distance between data DCs over 1000km, a wide area 
	high-performance network is required to transmit high-throughput
	training samples and corpus data. Usually, training large models in
	the billions or trillions tokens requires several hundred terabytes 
	to over P of corpus data, with a large amount of data transmission 
	per session, which places high demands on transmission throughput 
	and stability.</t>

	</section>
	
    <section numbered="true" toc="default"> <name>Collaborative and Interactive Data Transmission</name>
	<t>There are two types of scenarios: data transmission between 
	storage and computing separation data centers and high-throughput
	data transmission between DCs under distributed intelligent computing.
	The collaborative and interactive data transmission scenario is 
	shown in Figure 2, where data flows are transmitted between two 
	or more DCs, with a location distance ranging from 80km to 100km.</t>
	
	  <figure title="Collaborative and Interactive Data Transmission over MANs" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height="">	   
  
         +-------------80km~100km-----------------+
         |                                        |
    +----+----+                              +----+----+ 
    | Core DC |                              | Core DC |    
    +----+----+            MAN               +----+----+
         |                                        |
         |                                        |
    +----+----+	         +---------+         +----+----+
    | Edge DC +----------+ Edge DC +---------+ Edge DC |
    +---------+          +---------+         +---------+
  
	   </artwork>
    </figure> 	
	
	<t>Storage and computing separation scenario: the cloud services
	providers deploy multiple data center with storage and intelligent
	computing devices deployed separately in MAN (under 100km). By 
	extending the high-performance transmission technology used within
	the original DC to across data centers, the DC cluster with the 
	separated storage and computing is constructed. In 2023, Amazon 
	has implemented a Storage and computing separation data center for 
	high-throughput data transmission on the MAN with a speed of 100Gbps
	and 100 kilometers. In addition, the training sample of customers in
	industries such as government and finance is "sensitive data", and the
	consequences of data leakage are very serious. The sample data needs 
	to be storage in the customer's private DC and connected to the 
	cloud service provider's DC for AI through a wide area high-performance 
	network.</t>

    <t>Distributed coordination reasoning scenario: in order to improve
	the user experience of computing services, the architecture with
	centralized training and distributed reasoning is deployed. The
	training is carried out at core computing nodes that are far away 
	from the user, the inference is respond to the user at distributed
	edge nodes with closer distance, shorter latency, and better experience.
	Local sample data needs to be transmitted back between the core and 
	edge DCs through a high-performance MAN to fine tune and optimize the
	trained model. In addition, user inference requests and response 
	data require low latency transmission.</t>
    </section>
	
   </section>
   
    
   
   <section numbered="true" toc="default"> <name>Deployment Considerations for HP-WANs</name>

   
   <section numbered="true" toc="default"><name>Host Optimization Deployment</name>
   
   <t>The host optimization deployment mainly adopts the improved 
   transport layer protocol on the NIC of host server to 
   achieve long-distance and efficient transmission based on 
   lossy networks. The optimization of the transport layer 
   protocol may involve caching and resembling for out of order 
   packages, packet loss tolerant and error correction
   mechanism based on lossy network, etc. The host optimization
   deployment is as Figure 3 shown.</t>
   
  	  <figure title="Host Optimization Deployment Consideration" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height=""> 
         +--------------+      +---------------+      +--------------+ 
         |              |      |               |      |              |
    +----+----+         |      |      WAN      |      |         +----+----+ 
    | Host A  |         +------+     (lossy)   +------+         | Host B  |    
    +----+----+         |      |               |      |         +----+----+
         |DCN or        |      |               |      |DCN or        |
         |dedicated line|      |               |      |dedicated line|
         +--------------+      +---------------+      +--------------+ 
   The NIC with transport                             The NIC with transport 
   protocol optimization                              protocol optimization
   
   	   </artwork>
    </figure> 
   </section>
   
   <section numbered="true" toc="default"><name>WAN optimization Deployment</name>
   
    <t>The WAN optimize the performance of packet loss, bandwidth utilization, 
	and latency to provide high-throughput data transmission between 
	DCs. The optimization of wide area networks may involve path 
	selection, congestion control and flow control etc. The deterministic
	forwarding may also reduce the packet loss ratio, latency, and jitter 
	in wide area networks. The WAN optimization deployment is as Figure 
	4 shown.</t>
	
	  	  <figure title="Host Optimization Deployment Consideration" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height=""> 
      +--------------+      +------------------+      +--------------+ 
      |              |      |                  |      |              |
 +----+----+         |      |      WAN         |      |         +----+----+ 
 | Host A  |         +------+(High performance)+------+         | Host B  |    
 +----+----+         |      |                  |      |         +----+----+
      |DCN or        |      |                  |      |DCN or        |
      |dedicated line|      |                  |      |dedicated line|
      +--------------+      +------------------+      +--------------+ 
                             The optimization of 
                             packet loss, bandwidth 
                             utilization, and latency
                             in WAN
   	   </artwork>
    </figure> 

   </section>
    
   <section numbered="true" toc="default"><name>Gateway Deployment</name>
   
   <t>The solution requires the deployment of gateway devices at the DC 
   edge to isolate or relay traffic within the data center and wide area
   network. The gateway devices should support high-performance services
   packet caching, buffering, and retransmission, and implement The 
   collaboration and Interaction between gateway and WAN through running
   optimized high-performance transport layer protocols, including 
   high-performance services intelligence sensitive, routing selection
   and congestion control. In addition, the gateway also needs to have
   mapping and conversion of different high-performance protocols running 
   in the data center and WAN. The gateway deployment is as Figure 
   5 shown. </t>
	
	  	  <figure title="Gateway Deployment Consideration" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height=""> 
                            +-------------+ 
+---------+   +---------+   |             |   +---------+   +---------+ 
| Host A  +---+ Gateway +---+   WAN       +---+ Gateway +---+ Host B  |    
+---------+   +---------+   |  (Lossy)    |   +---------+   +---------+
                            +-------------+

   	   </artwork>
    </figure> 	

   </section>
   </section>   
   <section numbered="true" toc="default"> <name>Simulation Results</name>
   
   <section numbered="true" toc="default"> <name>The Impact of Long-distance Delay</name>
   
   <t>Based on the current implementation over 100km, the 
   selection of delay parameters in this experiment is mainly 
   aimed at wide area scenarios of 100~2000 km, with round trip
   time (RTT) of 1-20ms. In terms of parameter selection, this
   experiment is based on the superposition verification from 
   100km (1ms delay) to 2000km (20ms delay).</t>
   
   <t>The impact of long-distance delay on throughput is shown 
   as Figure 6.</t>
   
   <figure title="The Impact of Long-distance Delay on Throughput" align="center" suppress-title="false" alt="" width="" height="">
         <artwork align="center" xml:space="preserve" name="" type="" alt="" width="" height="">	   
   
 +-------------+--------------------+---------------+--------------------+
 |RTT latency  |message length(byte)|  distance     |Throughput(Gbps)    |
 +-------------+--------------------+---------------+--------------------+ 
 |less than 1ms|less than 1024      |less than 100km|more than90%@100Gbps| 
 +-------------+--------------------+---------------+--------------------+ 
 |     1ms     |   256K             |     100km     |more than90%@100Gbps| 
 +-------------+--------------------+---------------+--------------------+
 |     2ms     |   512K             |     200km     |more than90%@100Gbps| 
 +-------------+--------------------+---------------+--------------------+ 
 |     5ms     |   1M               |     500km     |more than90%@100Gbps| 
 +-------------+--------------------+---------------+--------------------+ 
 |     10ms    |   8M               |     1000km    |more than90%@100Gbps| 
 +-------------+--------------------+---------------+--------------------+  
	   </artwork>
    </figure> 
   <t>The transmission performance of RDMA in different network 
   environments is Verified. The impact of long distance and latency 
   on throughput performance is shown in Table 1. As latency 
   increases (1~20ms), the RDMA message size needs to be continuously
   increased to achieve high-performance transmission with 100% 
   throughput. Due to the maximum message length of 2GB, a bandwidth 
   of 100Gbit/s can be achieved without loss, satisfying the throughput
   theoretical calculation equation.</t>
               <t>Throughput = Window_Size/RTT (1)</t>
   <t>The overall analysis shows that by adjusting RDMA parameters 
   (such as message length), high-performance transmission of 1000km 
   (with over 90% throughput) can be achieved; The message length 
   setting is actually related to the specific network application, 
   device cache space, and cache threshold settings, and the increase 
   of message length is unlimited.</t>
   </section>
   
   <section numbered="true" toc="default"> <name>The Impact of Packet Loss</name>
   <t>The traditional RDMA adopts the Go-Back-N retransmission mechanism,
   which retransmits all data packets after the dropped data packet N. 
   Loss of packets can cause significant performance degradation in RDMA. 
   However, TCP only needs to retransmit lost individual packets, and the 
   latest RDMA network cards have started using selective repeat. Therefore, 
   the calculation formulas for TCP packet loss rate (p), message size (MSS),
   latency (RTT) and bandwidth capacity (C) can be referred to:</t>
             <t>Throughput = Min{MSS/RTT*C*(1/p)} (2)</t>	 
   <t>The actual testing performance of RDMA differs from that of TCP, and 
   the main impact of wide area networks is latency, with retransmission and 
   congestion control algorithm models being similar. Therefore, the theoretical 
   rate of RDMA is empirically judged by adjusting the value of parameter C in
   equation (2). (TCP empirical value C = 1.0)</t>

   <t>When both bigger delay and packet loss coexist and over 80% throughput 
   of a 100G link, the packet loss rate in the data center must be less than
   0.005%; In the scenario of wide area interconnection in DCs, due to the 
   increase in retransmission cost and response time caused by propagation link 
   delay, the packet loss threshold is more strict and harsh in the data 
   center, requiring the network to achieve lossless as much as possible. 
   In a wide area scenario, even with the optimization algorithm of selective
   retransmission, it is difficult to achieve a bandwidth utilization rate 
   of over 70% when the packet loss rate is less than 0.001%.</t>

   <t>In general, the network performance indicators for RDMA over a wide
   area of 1000 kilometers are as follows: the throughput of RDMA over a wide
   area is directly proportional to the length of message size, and inversely
   proportional to the network packet loss rate and latency. To ensure 80% 
   throughput of links over 100Gbps and 1000 kilometers, the message length 
   needs to be greater than 512KB, resulting in extremely strict packet loss
   rate indicators due to increased latency.</t>

   </section>
   
   </section>

   <section  numbered="true" toc="default"> <name>Security Considerations</name>
   <t>TBA</t>
   </section>
   <section numbered="true" toc="default"> <name>IANA Considerations</name>
   <t>This document makes no requests for IANA action.</t>
   </section>
	
   <section numbered="true" toc="default"> <name>Acknowledgements</name>
   <t>TBA</t>
   </section> 
   
  </middle>
  
  <!--  *****BACK MATTER ***** -->

 <back>
 
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.8664.xml"/>
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.9232.xml"/>
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7424.xml"/>	
        <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml"/>
		<xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.9438.xml"/>
        <xi:include href="https://datatracker.ietf.org/doc/bibxml3/draft-xiong-hpwan-uc-req-problem.xml"/>		
		
      </references>
    </references>
 
 </back>
</rfc>
