<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-model href="rfc7991bis.rnc"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
     please see http://xml.resource.org/authoring/README.html. -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="3"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" docName="draft-song-inc-transport-protocol-req-01" ipr="trust200902" consensus="true" xmlns:xi="http://www.w3.org/2001/XInclude">  

<front>
    <title abbrev="TP for INC">The Requirements of a Unified Transport Protocol for In-Network Computing in Support of RPC-based Applications</title>
    
    <author fullname="Haoyu Song" initials="H." surname="Song">
      <organization>Futurewei Technologies</organization>
      <address>
        <postal>
          <street></street>
          <city>Santa Clara</city>
          <region>CA</region>
          <code></code>
          <country>US</country>
        </postal>
        <email>haoyu.song@futurewei.com</email>
      </address>
    </author>
    
    <author fullname="Wenfei Wu" initials="W." surname="Wu">
      <organization>Peking University</organization>
      <address>
        <postal>
          <street></street>
          <city>Beijing</city>
          <region></region>
          <code></code>
          <country>CN</country>
        </postal>
        <email>wenfeiwu@pku.edu.cn</email>
      </address>
    </author>

    <author fullname="Dirk Kutscher" initials="D." surname="Kutscher">
      <organization>The Hong Kong University of Science and Technology (Guangzhou)</organization>
      <address>
        <postal>
          <street></street>
          <city>Guangzhou</city>
          <region></region>
          <code></code>
          <country>CN</country>
        </postal>
        <email>ietf@dkutscher.net</email>
      </address>
    </author>

  <area/>
  <workgroup></workgroup>


  <abstract>
    <t>
      In-network computing breaks the end-to-end principle and
      introduces new challenges to the transport layer
      functionalities. This draft provides the background of a suite
      of RPC-based applications which can take advantage of INC
      support, surveys the existing transport protocols to show they
      are insufficient or improper to be used in this context, and
      lays out the requirements to develop a general transport
      protocol tailored for such applications. The purpose of this
      draft is to help understand the problem domain and inspire the
      design and development a unified INC transport protocol.
    </t>
  </abstract>
  
  <note title="Requirements Language">
    <t>
      The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
      NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
      "OPTIONAL" in this document are to be interpreted as described
      in <xref target="RFC2119">RFC 2119</xref>.
    </t>
  </note>
</front>

  <middle>
    <section anchor="motivation" numbered="true" toc="default">
      <name>Motivation</name>
      <t>
      </t>

      <t>
	In a broader sense, COmputing-In-Network (COIN) covers many
	distinct types of applications which rely on networks to do
	more than packet forwarding (e.g., active networking, edge
	computing, and service function chaining). However, the
	emerging term In-Network Computing (INC) <xref target="inc"/>
	in particular refers to a narrower scope which applies on-path
	programmable networking devices (e.g., switches and routers
	between clients and servers) as an accelerator or function
	offloader to boost throughput, reduce server load, or improve
	latency, typically in a well-controlled data center network
	environment.
      </t>
      <t>
	Some INC implementations evolved from programmable data plane
	systems and align with the trend of network programmability at
	large. In recent year, it has been shown to support many
	promising applications (e.g., caching, aggregation, and
	agreement). For example, in distributed machine learning
	(DML), training nodes produce data (gradients) that needs to
	be aggregated or reduced -- and the result could be
	distributed to one or multiple consumers. As another example,
	the NetClone system <xref target="netclone"/> uses in-network
	forwarder to replicate RPC invocation messages and to perform
	more informed forwarding based on observed latencies for
	accelerating RPC communication.
      </t>
      <t>
	While it is possible to achieve this kind of operation purely
	with end-to-end communication between worker nodes,
	performance can be dramatically improved by offloading both
	the operation processing and the data dissemination to nodes
	in the network. These in-network processors are often
	conceived as semi-transparent performance enhancing on-path
	elements, i.e., they are not the actual endpoints in transport
	protocol sessions and would intercept packets with application
	data and potentially generate new data that they would have to
	transmit.
      </t>

      <t>
	The intended INC behavior can thus not be achieved with
	existing end-to-end transport protocols such as TCP and QUIC.
	Conventionally, the network devices are only supposed
	to process the packets up to the network layer and leave the
	upper layers (i.e., transport layer and application layer)
	intact for the end hosts to process; however, INC requires the
	network devices to participate in the application logic so
	inevitably they need to process the related packets up to the
	application layer, as shown in <xref target="stack" />.
      </t>
      
      
      <figure anchor="stack">
        <name>Network Protocol Stack in INC</name>
	
        <artwork align="center" name="" type="" alt=""><![CDATA[
		
                  /-------------------\ 
                 /     INC devices     \
+-----------+   /     +-----------+     \    +-----------+  
|application|   |     |application|     |    |application|
+-----------+   |     +-----------+     |    +-----------+
| transport |   |     | transport |     |    | transport |
+-----------+   |     +-----------+     |    +-----------+
|  network  |<--+---->|  network  |<----+--->|  network  |
+-----------+   \     +-----------+     /    +-----------+
   client        \---------------------/         server
                         network 
         ]]></artwork>
      </figure>

<!--
      <t>
	Although such an architectural deviation does introduce some
	complexity to the network system, given the significant
	benefits presented by the applications, it is worthwhile to
	make the effort, as long as we can limit the use to just the
	beneficial applications and confine the scope in a confined
	network domain (e.g., a data center network).
      </t>
-->	  
      <t>
	In the context of the INC systems we refer to here, the
	computing functions need to be done in data plane fast
	path. There may be other use cases where a network device
	needs to direct the application packets to the slow path
	(e.g., a local CPU or a remote server) for processing, which
	we do not consider here.
      </t>
      <t>
	Programmable data plane devices use different programming
	languages (e.g., P4 and HDL) and have different chip
	architectures (e.g., RMT pipeline, RTC, and FPGA).  These
	devices are optimized for simple packet processing and
	forwarding with limited hardware resources. Specifically, the
	devices are difficult to support complex stateful operations
	and mathematical calculations beyond integer addition and
	shift. No surprise the in-network computing functions for the
	supported applications are all relatively simple (e.g.,
	resorting to lookup tables or counters). However, the
	programmable switch chip technology is also progressing fast
	with better stateful operation support and computing
	capabilities.  It is conceivable that future programmable
	switches could undertake more computing tasks, albeit still in
	a facilitating role.
      </t>

      <t>
	To correctly handle the computing tasks, however, a reliable
	transport layer must be present. The transport layer provides
	the common services such as connection maintenance,
	reliability, flow control, and multiplexing. The existing INC
	applications either make oversimplified assumption to eschew
	this problem (e.g., assume the use of UDP as the transport
	layer protocol or ignore it) or provided ad hoc solution
	dedicated to a particular application which entangles the
	transport and application functions (e.g., ATP).  A general
	protocol for the transport layer is needed for INC to take
	care the common transport issues. It can free the application
	developers from worrying about the transport issues and help
	them focus on the application logic itself.
      </t>

      <t>
	This draft provides the background of a suite of RPC-based
	applications which can take advantage of INC support, surveys
	the existing transport protocols to show they are insufficient
	or improper to be used in this context, and lays out the
	requirements to develop a general transport protocol tailored
	for such applications. The purpose of this draft is to help
	understand the problem domain and inspire the design and
	development a unified INC transport protocol.
      </t>
    </section>


<section anchor="classification" numbered="true" toc="default">
      <name>INC Application RPCs</name>

      <t>
	The INC applications concerned in this draft all follow the
	communication paradigm of idempotent Remote Procedure Call
	(RPC): A client sends a message with arguments to a server and
	gets a response back which reflects the computation result
	based on the arguments. On the one hand, it is unlike TCP
	which is mainly used for transferring byte streams; on the
	other hand, it requires a reliable datagram service more than
	what UDP can support.
      </t>	  

      <t>
	We can classify these INC applications into three service
	models:
      </t>

      <t>
	<list style="hanging">  

	  <t hangText="Synchronous Collaboration (SC):"> from a set of
	  clients, each sends a piece of data to a server roughly at
	  the same time. The result can be computed and sent back to
	  the clients when all the data pieces are received. A notable
	  example is AllReduce (one operation in the class of
	  Collective Communication <xref
	  target="I-D.yao-tsvwg-cco-problem-statement-and-usecases"/>). Quite
	  often there is one result that needs to be transmitted back
	  to all clients, i.e., a multi-destination delivery service
	  could be applied.
	  </t>

	  <t hangText="Asynchronous Collaboration (AC):"> from a set
	  of clients, each sends multiple data items to a server. The
	  result can be computed when all the data items are
	  received. An example of such applications is MapReduce</t>


	  <t hangText="Individual Request (IR):"> a client sends
	  individual requests to a server and get a response for each
	  request. An example of such application is NetCache <xref
	  target="netcache"/>.</t>

      </list></t>


      <t>
	From a different perspective, we can observe that there are
	three basic communication modes depending on the applications,
	as shown in <xref target="inc-comm-mode" />. From a
	client-perspective, the INC support is transparent, i.e., the
	client sends a message, such as an RPC, and if there is an
	on-path INC device, it could execute the operation, as an
	optimization. If there is no such on-path INC device, the
	message would be transmitted to a specified
	endpoint. Depending on the actual network configuration,
	capabilities, and load situation, one of the following modes
	can be selected:
      </t>
      <t>
	<list style="hanging"> 

	  <t hangText="Device Only Mode (DO):"> the INC network
	  devices alone can completely finish a computing task.
	  Therefore a client can choose to send a task to the INC
	  network devices instead of a server and the final result is
	  directly returned to the client from the INC network
	  devices. </t>

	  <t hangText="Device+Server Mode (DS):"> the INC network
	  devices can only partially finish a computing task and the
	  intermediate result still needs to be sent to a server to
	  finalize. The final result must be returned to the client
	  from a server. </t>

	  <t hangText="Hybrid Mode (HM):"> the INC network devices may
	  or may not finish a computing task, therefore the final
	  result may be returned by the INC network devices or by a
	  server. </t>

	</list>
      </t>

      <t>
	Each mode has its dominant benefits: Using DO mainly aims to
	reduce the latency and using DS mainly aims to reduce the
	traffic bandwidth and server load. Using HM may achieve both
	benefits, albeit with more implementation complexity.
      </t>



      <figure anchor="inc-comm-mode">
        <name>In Network Computing Working Modes</name>
        <artwork align="center" name="" type="" alt=""><![CDATA[
	        
                   +-------+            
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client|<------->|devices| |        |server|
|      |         |       |-+        |      |
+--^---+         +-------+          +---^--+
   |                                    |
   +------------------------------------+                  
               Device Only Mode (DO)
			   
                   +-------+
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client+-------->|devices+-+------->|server|
|      |         |       |-+        |      |
+--^---+         +-------+          +--+---+
   |                                   |
   +-----------------------------------+         
              Device+Server Mode (DS)
         
                   +-------+
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client+-------->|devices+.........>|server|
|      |<--------|       |-+        |      |
+--^---+         +-------+          +--.---+
   :                                   :
   .....................................        
              Hybrid Mode (HM)
			  
         ]]></artwork>
      </figure>


      <t>
	<xref target="model"/> provides the dominant combinations of
	the service model and communication model. Since AC may
	require too much resources which exceed network device's
	capability, so it is less used with the DO mode; IR usually
	aims to optimize the response latency, so the DS mode is less
	helpful, yet HM may provide a fallback mechanism for
	unsatisfied requests.
      </t>


      <figure anchor="model">
        <name>Service Model and Communication Model </name>
	
        <artwork align="center" name="" type="" alt=""><![CDATA[
+-----------------------+-----+-----+-----+
|                       | DO  | DS  | HM  |
+-----------------------+-----+-----+-----+   		
|Sync Collaboration(SC) |  x  |  x  |  x  |
+-----------------------+-----+-----+-----+
|Async Collaboration(AC)|     |  x  |     |
+-----------------------+-----+-----+-----+
|Individual Request(IR) |  x  |     |  x  |
+-----------------------+-----+-----+-----+
         ]]></artwork>
      </figure>


</section>



<section anchor="survey" numbered="true" toc="default">
	<name>Existing Transport Protocols</name>

	<t>
	  We argue that the existing transport protocols are not
	  suitable for INC.
	</t>

	<t><list style="hanging">
	  <t hangText="TCP:"> As the most widely used transport
	  protocol, TCP (as well as its variants such as DCTCP and
	  MPTCP) is ruled out because of its end-to-end streaming
	  semantics. Any mutation to the TCP packet payloads is
	  consider a break to the stream, but the INC applications
	  which require network device collaboration do need to modify
	  the packet payload. Also, any dropped packet in a TCP stream
	  sensed by the receiver must be re-transmitted; this
	  prohibits the INC applications which can terminate a packet
	  and return the computing result directly. While
	  theoretically it is possible to make the network device
	  maintain two separate TCP connections with the two
	  communicating end hosts, the cost of implementation is
	  prohibitively large. Due to its handshake overhead and its
	  longer startup times, TCP is also not a good protocol for
	  high-performance RPC communication <xref target="davie"/>.
	  More issues about TCP in data center can be found in <xref
	  target="homa" />.
	  </t>

	  <t hangText="UDP:"> As another common transport protocol,
	  UDP is unreliable and lack of mechanisms for flow
	  control. Some previous INC application assumes the use of
	  UDP as the transport layer for simplicity, but the
	  provisional measure cannot meet the production level
	  requirement and provide enough transport layer support for
	  all the concerned INC applications. While these feature
	  could be implemented on-top of UDP, this would shift
	  complexity to applications and INC implementations.
	  </t>

	  <t hangText="QUIC:">In general, QUIC provide a better
	  platform for efficient RPC communication compared to TCP
	  <xref target="davie"/>.  However, it is designed for wide
	  area network, and a part of the packet header and the
	  payload are encrypted which prohibits the application layer
	  packet processing in network devices and, potentially, add
	  meta data.
<!--
	  Even without the encryption,
	  the QUIC header information is not enough to support the INC
	  applications.
-->

	  </t>

	  <t hangText="MTP:"> <xref target="mtp">MTP</xref> is the
	  first transport protocol dedicated for INC.  It grasps some
	  core requirements for INC and is open to different
	  congestion control algorithms. But it is inspired by the
	  pathlet routing and mainly focus on pathlet-based congestion
	  control support. It is lack of efficient support to all the
	  application types aforementioned.
	  </t>
	  
	  <t hangText="RDMA:"> RDMA allows two end hosts to exchange
	  data quickly. With either native support (i.e., Infiniband)
	  or piggybacked by UDP or TCP, it requires in-order and
	  immutable transport which has similar challenges as TCP for
	  INC applications.</t>

	  <t hangText="HOMA:"> <xref target="homa">HOMA</xref> is
	  proposed to be a transport protocol in data center to
	  replace TCP. However, HOMA is not designed with INC in mind
	  either.</t>

	  <t hangText="Information-Centric Networking"> (ICN) provide
	  a receiver-driven, data-oriented communication services and
	  has features such address-less operation due to the
	  named-data access principle. It also provide intrinsic
	  multi-destination delivery and has been demonstrated in
	  remote method invocation and distributed computing scenarios
	  <xref target="icndiscomp"/>, albeit not yet the particular
	  INC scenarios as presented here.</t>

	  <t hangText="Ad Hoc Protocols:"> Several INC applications
	  (e.g., ATP and ASK) provide a customized transport
	  layer. However, these protocols only work for a particular
	  application. Moreover, there is a lack of a clear separation
	  between the transport layer and the application layer. Some
	  application layer function leaks into the transport layer,
	  further limiting their generality. </t>
	  
	</list></t>

</section>


<section anchor="requirement" numbered="true" toc="default">
  <name>Requirements</name>
	  
	  
  <t>
    The premise of the E2E principle is that it is more costly to
    guarantee the level of reliability by relying on the network than
    relying on the end hosts. INC introduces multiple end points in
    the communication with one of them resides in the network,
    effectively changing the communication paradigm from E2E to E2I2E
    (I means intermediate nodes which conduct the transport layer
    functionalities). Therefore, we need to revisit the E2E principle
    to see if we can break it or adapt to it in the new context.  We
    can observe several properties for the covered INC
    applications.
  </t>

  <t>
    <list style="symbol">
      
      <t>
	In principle, INC protocols should run over existing networks,
	and not make any assumptions on the type of environment they
	are used in, such as data center or access network. However,
	for performance reasons, some optimizations may be needed that
	would limit the deployment to such specific domains.
      </t>

      <t>
	When deployed in data center for use cases such DML, an INC
	system needs to provide High-Performance-Computing (HPC)
	levels of performance. In such communication scenarios, exact
	timing and scheduling may be required.
      </t>
      


      <!--
	        <t>
	These applications are usually applied in a limited network
	scale with topology regularity (e.g., a data center network or
	an access network).</t> <t> The network is under the control
	of a single administrative entity.</t> <t>The benefit-cost
	ratio is significant enough to warrant the INC deployment,
	which usually requires a software- hardware and host-device
	co-design.</t>
-->

	<t>
	  Multiple applications with the same or different service
	  models, or multiple jobs for the same applications can be
	  active at the same time.
	</t>

	<t>
	  INC should be seen as an optional performance enhancement
	  that can be added to a network if needed, but the overall
	  system should still work without such INC systems in the
	  network.
	</t>

	
    </list>
  </t>

  <t>
    Based on these observation, a new transport layer protocol, for
    INC in support of RPC-based applications can be designed. The
    protocol only works in a limited domain and it virtualizes the
    network as a single logical middle point. That is, if multiple
    network devices collaborate on a computing task, they are
    considered as one device. Packet forwarding among these devices
    needs to be handled by the network layer using techniques such as
    Segment Routing (SR) and Service Function Chaining (SFC),
    depending on the overall system design.
  </t>
	  
      
  <t>
    From the previous discussion, we lay out the design requirements
    of a transport protocol dedicated for INC:
  </t>

    <t><list style="hanging">
      <t hangText="Simplicity:"> Due to the limited resource and
      capability of the programmable network devices, the transport
      layer functions in them cannot be complex. For example, the
      per-flow state machine and congestion control algorithms are
      difficult to be implemented in the programmable network
      devices. The protocol should aim to leave the complexity to the
      end hosts and require only simple processing in the programmable
      network devices.
      </t>

      <t hangText="Generality:"> The different service models and
      communication models should be all supported. The protocol
      should also be independent of the underlying network layer
      protocol. </t>

      <t hangText="Openness:"> Since the performance requirements of
      the applications may vary, the flow control and reliability
      mechanism of the protocol should be open to different
      algorithms.</t>

      <t hangText="Compatibility:"> The protocol should be able to
      coexist with the other transport protocols.</t>
    </list></t>
      
</section>

<section anchor="IANA" numbered="true" toc="default">
  <name>IANA Considerations</name> 
  <t>This document includes no request to IANA.</t>
</section>

<section anchor="Security" numbered="true" toc="default">
  <name>Security Considerations</name>  
  <t>tbd</t>
</section>

</middle>

<back>
  
  <references title="Normative References">
    <?rfc include='reference.RFC.2119'?> 	
  </references>
  
  <references title="Informative References">
    
    <reference anchor="homa" target="http://dx.doi.org/10.48550/arXiv.2210.00714">
      <front>
	<title>It's Time to Replace TCP in the Datacenter</title>
	<author initials="J." surname="Ousterhout"/>
	<date year="2023"/>
      </front>
    </reference>
    
    <reference anchor="mtp" target="http://dx.doi.org/10.1145/3484266.3487382">
      <front>
	<title>TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP)</title>
	<author initials="B." surname="Stephens"/>
	<author initials="D." surname="Grassi"/>
	<author initials="H." surname="Almasi"/>
	<author initials="T." surname="Ji"/>
	<author initials="B." surname="Vamanan"/>
	<author initials="A." surname="Akella"/>
	<date year="2021"/>
      </front>
    </reference>

    <reference anchor="inc" target="https:dx.doi.org/10.1109/ISCA45697.2020.00085">
      <front>
	<title>An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives</title>
	<author initials="B." surname="Klenk et al."/>
	<date year="2020"/>
      </front>
      <refcontent>ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)</refcontent>
    </reference>

    <reference anchor="netclone" target="https://dl.acm.org/doi/10.1145/3603269.3604820">
      <front>
	<title>NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs</title>
	<author initials="G." surname="Kim"/>
	<date year="2023"/>
      </front>
      <refcontent>In Proceedings of the ACM SIGCOMM 2023 Conference
      (ACM SIGCOMM '23). Association for Computing Machinery, New
      York, NY, USA, 195-207</refcontent>
    </reference>

    <?rfc include="reference.I-D.yao-tsvwg-cco-problem-statement-and-usecases.xml"?>

    <reference anchor="netcache">
      <front>
	<title>NetCache: Balancing Key-Value Stores with Fast In-Network Caching</title>
	<author fullname="Xin Jin"/>
	<author fullname="Xiaozhou Li"/>
	<author fullname="Haoyu Zhang"/>
	<author fullname="Robert Soule"/>
	<author fullname="Jeongkeun Lee"/>
	<author fullname="Nate Foster"/>
	<author fullname="Changhoon Kim"/>
	<author fullname="Ion Stoica"/>
	<date year="2017"/>
      </front>
      <refcontent>In Proceedings of the 26th Symposium on Operating
      Systems Principles (SOSP '17). Association for Computing
      Machinery, New York, NY, USA,
      121-136. https://doi.org/10.1145/3132747.3132764</refcontent>
    </reference>

    <reference anchor="davie">
      <front>
	<title>QUIC is not a TCP Replacement</title>
	<author fullname="Bruce Davie"/>
	<date year="2022" month="September" day="26"/>
      </front>
      <refcontent>https://systemsapproach.substack.com/p/quic-is-not-a-tcp-replacement</refcontent>
    </reference>

    <reference anchor="icndiscomp">
    <front>
      <title>SoK: Distributed Computing in ICN</title>
      <author fullname="Wei Geng"/>
      <author fullname="Yulong Zhang"/>
      <author fullname="Dirk Kutscher"/>
      <author fullname="Abhishek Kumar"/>
      <author fullname="Sasu Tarkoma"/>
      <author fullname="Pan Hui"/>
      <date year="2023"/>
    </front>
    <refcontent>In Proceedings of the 10th ACM Conference on Information-Centric Networking (ACM ICN '23). Association for Computing Machinery, New York, NY, USA, 88-100. https://doi.org/10.1145/3623565.3623712</refcontent>
  </reference>
    </references>    
  
  <?rfc include=''?>

  
</back>

</rfc>
