<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
     please see http://xml.resource.org/authoring/README.html. -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="3"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" docName="draft-song-inc-transport-protocol-req-00" ipr="trust200902" consensus="true"> 

<front>
    <title abbrev="TP for INC">The Requirements of a Unified Transport Protocol for In-Network Computing in Support of RPC-based Applications</title>
    
    <author fullname="Haoyu Song" initials="H." surname="Song">
      <organization>Futurewei Technologies</organization>
      <address>
        <postal>
          <street></street>
          <city>Santa Clara</city>
          <region>CA</region>
          <code></code>
          <country>US</country>
        </postal>
        <email>haoyu.song@futurewei.com</email>
      </address>
    </author>
    
	<author fullname="Weifei Wu" initials="W." surname="Wu">
      <organization>Peking University</organization>
      <address>
        <postal>
          <street></street>
          <city>Beijing</city>
          <region></region>
          <code></code>
          <country>CN</country>
        </postal>
        <email>wenfeiwu@pky.edu.cn</email>
      </address>
    </author>

  <area/>
  <workgroup></workgroup>


  <abstract>
    <t>In-network computing breaks the end-to-end principle and introduces new challenges to the transport layer functionalities. This draft provides the background of a suite of RPC-based applications which can take advantage of INC support, surveys the existing transport protocols to show they are insufficient or improper to be used in this context, and lays out the requirements to develop a general transport protocol tailored for such applications. The purpose of this draft is to help understand the problem domain and inspire the design and development a unified INC transport protocol.</t>
  </abstract>
  
  <note title="Requirements Language">
           <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
           "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
           document are to be interpreted as described in <xref target="RFC2119">RFC 2119</xref>.</t>
  </note>
</front>

  <middle>
    <section anchor="motivation" numbered="true" toc="default">
      <name>Motivation</name>
      <t>In a broader sense, COmputing-In-Network (COIN) covers
many distinct types of applications which rely on networks
to do more than packet forwarding (e.g., active networking,
edge computing, and service function chaining). However, the emerging term In-Network Computing (INC) in
particular refers to a narrower scope which applies on-path
programmable networking devices (e.g., switches and routers
between clients and servers) as an accelerator or function
offloader to boost throughput, reduce server load, or improve
latency, typically in a well-controlled data center network environment. 
INC is a natural outgrowth of the programmable
data plane progress and the trend of network programmability
at large. In recent year, it has been shown to support many promising applications (e.g., caching,
aggregation, and agreement).</t>

<t>An unfortunate consequence of INC is that it breaks the end-
to-end principle and the commonly accepted network protocol
layering model as used in packet networks for decades. Conventionally, the
network devices are only supposed to process the packets up
to the network layer and leave the upper layers (i.e., transport
layer and application layer) intact for the end hosts to process;
however, INC requires the network devices to participate in the
application logic so inevitably they need to process the related
packets up to the application layer, as shown in <xref target="stack" />.</t>
      
      
	   <figure anchor="stack">
        <name>Network Protocol Stack in INC</name>
	
        <artwork align="center" name="" type="" alt=""><![CDATA[
		
                  /-------------------\ 
                 /     INC devices     \
+-----------+   /     +-----------+     \    +-----------+  
|application|   |     |application|     |    |application|
+-----------+   |     +-----------+     |    +-----------+
| transport |   |     | transport |     |    | transport |
+-----------+   |     +-----------+     |    +-----------+
|  network  |<--+---->|  network  |<----+--->|  network  |
+-----------+   \     +-----------+     /    +-----------+
   client        \---------------------/         server
                         network 
         ]]></artwork>
      </figure>

<t>Although such an architectural deviation does introduce some complexity to the network system, given the significant benefits
presented by the applications, it is worthwhile to make
the effort, as long as we can limit the use to just the beneficial
applications and confine the scope in a confined network domain
(e.g., a data center network).</t>
	  
<t>The computing functions need to be done in data plane
fast path. If a network device needs to direct the application
packets to the slow path (e.g., a local CPU or a remote
server) for processing, that is no longer the INC in the
scope of this draft (and its rationale becomes suspicious
in this case). Programmable data plane devices use different
programming languages (e.g., P4 and HDL) and have different
chip architectures (e.g., RMT pipeline, RTC, and FPGA).
These devices are optimized for simple packet processing and
forwarding with limited hardware resources. Specifically, the
devices are difficult to support complex stateful operations
and mathematical calculations beyond integer addition and
shift. No surprise the in-network computing functions for the
supported applications are all relatively simple (e.g., resorting
to lookup tables or counters). However, the programmable
switch chip technology is also progressing fast with better
stateful operation support and computing capabilities.
It is conceivable that future programmable switches could
undertake more computing tasks, albeit still in a facilitating
role.</t>

<t>To correctly handle the computing tasks, however, a reliable
transport layer must be present. The transport layer provides
the common services such as connection maintenance, reliability, flow control, and multiplexing. The existing INC
applications either make oversimplified assumption to eschew
this problem (e.g., assume the use of UDP as the transport
layer protocol or ignore it) or provided ad hoc solution
dedicated to a particular application which entangles the transport and application functions (e.g., ATP).
A general protocol for the transport layer is needed for INC to take
care the common transport issues. It can free the application
developers from worrying about the transport issues and help
them focus on the application logic itself.</t>

<t>
This draft provides the background of a suite of RPC-based applications which can take advantage of INC support, surveys the existing transport protocols to show they are insufficient or improper to be used in this context, and lays out the requirements to develop a general transport protocol tailored for such applications. The purpose of this draft is to help understand the problem domain and inspire the design and development a unified INC transport protocol. 
</t>

    </section>


<section anchor="classification" numbered="true" toc="default">
      <name>INC Application Classification</name>

<t>The INC applications concerned in this draft all follow the communication paradigm of Remote Procedure Call
(RPC): A client sends a message with arguments to a server
and get a response back which reflects the computation result
based on the arguments. On the one hand, it is unlike TCP
which is mainly used for transferring byte streams; on the
other hand, it requires a reliable datagram service more than
what UDP can support.
</t>	  
	  
<t>We can classify these INC applications into three service
models:</t>

<t><list style="hanging">  

<t hangText="Synchronous Collaboration (SC):"> from a set of clients,
each sends a piece of data to a server roughly at the
same time. The result can be computed and sent back to the clients when all the data
pieces are received. A notable example is AllReduce.</t>

<t hangText="Asynchronous Collaboration (AC):"> from a set of clients,
each sends multiple data items to a server. The result can
be computed when all the data items are received. An
example of such applications is MapReduce</t>


<t hangText="Individual Request (IR):"> a client sends individual requests
to a server and get a response for each request. An
example of such application is NetCache.</t>

</list></t>


<t>From a different perspective, we can observe that there are three basic communication modes depending on the applications, as shown
in <xref target="inc-comm-mode" />:</t>

<t><list style="hanging"> 

<t hangText="Device Only Mode (DO):"> the INC network devices alone can completely finish a computing task.
Therefore a client can choose to send a task to the INC network devices instead of a server and the final result is directly returned to
the client from the INC network devices. </t>

<t hangText="Device+Server Mode (DS):"> the INC network devices can only partially finish a computing task and the intermediate result still needs to be sent to a server to finalize. The final result must be returned to the client from a server. </t> 

<t hangText="Hybrid Mode (HM):"> the INC network devices may or may not finish a computing task, therefore the final result may be returned by the INC network devices or by a server. </t> 

</list></t>

<t>
Each mode has its dominant benefits: Using DO mainly aims
to reduce the latency and using DS mainly aims to reduce the
traffic bandwidth and server load. Using HM may achieve both benefits, albeit with more implementation complexity. 
</t>



	  <figure anchor="inc-comm-mode">
        <name>In Network Computing Working Modes</name>
        <artwork align="center" name="" type="" alt=""><![CDATA[
	        
                   +-------+            
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client|<------->|devices| |        |server|
|      |         |       |-+        |      |
+--^---+         +-------+          +---^--+
   |                                    |
   +------------------------------------+                  
               Device Only Mode (DO)
			   
                   +-------+
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client+-------->|devices+-+------->|server|
|      |         |       |-+        |      |
+--^---+         +-------+          +--+---+
   |                                   |
   +-----------------------------------+         
              Device+Server Mode (DS)
         
                   +-------+
+------+         +-------+ |        +------+  
|      |         |network| |        |      |
|client+-------->|devices+.........>|server|
|      |<--------|       |-+        |      |
+--^---+         +-------+          +--.---+
   :                                   :
   .....................................        
              Hybrid Mode (HM)
			  
         ]]></artwork>
      </figure>


<t><xref target="model"/> provides the dominant combinations of the service
model and communication model. Since AC may require too
much resources which exceed network device's capability, so
it is less used with the DO mode; IR usually aims to optimize
the response latency, so the DS mode is less helpful, yet HM may provide a fallback mechanism for 
unsatisfied requests.</t>


	   <figure anchor="model">
        <name>Service Model and Communication Model </name>
	
        <artwork align="center" name="" type="" alt=""><![CDATA[
+-----------------------+-----+-----+-----+
|                       | DO  | DS  | HM  |
+-----------------------+-----+-----+-----+   		
|Sync Collaboration(SC) |  x  |  x  |  x  |
+-----------------------+-----+-----+-----+
|Async Collaboration(AC)|     |  x  |     |
+-----------------------+-----+-----+-----+
|Individual Request(IR) |  x  |     |  x  |
+-----------------------+-----+-----+-----+
         ]]></artwork>
      </figure>


</section>



<section anchor="survey" numbered="true" toc="default">
	<name>Existing Transport Protocols</name>
	
	<t>We argue that the existing transport protocols are not
suitable for INC.</t>

    <t><list style="hanging">
			<t hangText="TCP:"> As the most widely used transport protocol, TCP (as well as
its variants such as DCTCP and MPTCP) is ruled out because
of its end-to-end streaming semantics. Any mutation to the
TCP packet payloads is consider a break to the stream, but the
INC applications which require network device collaboration
do need to modify the packet payload. Also, any dropped
packet in a TCP stream sensed by the receiver must be re-transmitted; 
this prohibits the INC applications which can terminate a packet and return the computing result directly. While
theoretically it is possible to make the network device maintain
two separate TCP connections with the two communicating
end hosts, the cost of implementation is prohibitively large.
More issues about TCP in data center can be found in <xref target="homa" />.</t>

			<t hangText="UDP:"> As another common transport protocol, UDP is unreliable
and lack of mechanisms for flow control. Some previous INC
application assumes the use of UDP as the transport layer
for simplicity, but the provisional measure cannot meet the
production level requirement and provide enough transport
layer support for all the concerned INC applications.</t>

			<t hangText="QUIC:"> QUIC works for the RPC kind of communication.
However, it is designed for wide area network, and a part of the
packet header and the payload are encrypted which prohibits
the application layer packet processing in network devices.
Even without the encryption, the QUIC header information is
not enough to support the INC applications. </t>

			<t hangText="MTP:"> <xref target="mtp">MTP</xref> is the first transport protocol dedicated for INC.
It grasps some core requirements for INC and is open to
different congestion control algorithms. But it is inspired by
the pathlet routing and mainly focus on pathlet-based
congestion control support. It is lack of efficient support to all
the application types aforementioned.</t>

			<t hangText="RDMA:"> RDMA allows two end hosts to exchange data quickly. With
either native support (i.e., Infiniband) or piggybacked by UDP
or TCP, it requires in-order and immutable transport which
has similar challenges as TCP for INC applications.</t>

			<t hangText="HOMA:"> <xref target="homa">HOMA</xref> is proposed to be a transport protocol in data center
to replace TCP. However, HOMA is not designed with
INC in mind either.</t>

			<t hangText="Ad Hoc Protocols:"> Several INC applications (e.g., ATP and ASK) provide
a customized transport layer. However, these protocols only
work for a particular application. Moreover, there is a lack of a
clear separation between the transport layer and the application
layer. Some application layer function leaks into the transport
layer, further limiting their generality. </t>

</list></t>

</section>


<section anchor="requirement" numbered="true" toc="default">
      <name>Requirements</name>
	  
	  
<t>The premise of the E2E principle is that it is more costly
to guarantee the level of reliability by relying on the network
than relying on the end hosts. INC introduces multiple end
points in the communication with one of them resides in the
network, effectively changing the communication paradigm from E2E to E2I2E (I means intermediate nodes which conduct the transport layer functionalities). Therefore, we need to revisit the E2E principle to
see if we can break it or adapt to it in the new context.
We can observe several properties for the covered INC
applications.</t>

 <t><list style="symbol">

<t>These applications are usually applied in a limited network scale with topology regularity (e.g., a data center
network or an access network).</t>
<t> The network is under the control of a single administrative entity.</t>
<t>The benefit-cost ratio is significant enough to warrant
the INC deployment, which usually requires a software-
hardware and host-device co-design.</t>
<t>Multiple applications with the same or different service
models, or multiple jobs for the same applications can be
active at the same time.</t>

</list></t>

<t>Based on these observation, a new transport layer protocol,
for INC in support of RPC-based applications can be designed. The protocol
only works in a limited domain and it virtualizes the network
as a single logical middle point. That is, if multiple network
devices collaborate on a computing task, they are considered
as one device. Packet forwarding among these devices needs
to be handled by the network layer using techniques such as
Segment Routing (SR) and Service Function Chaining (SFC).</t>
	  
      
<t> From the previous discussion, we lay out the design requirements of a transport protocol dedicated for INC :</t>

        <t><list style="hanging">
			<t hangText="Simplicity:"> Due to the limited resource and capability of the
programmable network devices, the transport layer functions
in them cannot be complex. For example, the per-flow state
machine and congestion control algorithms are difficult to be
implemented in the programmable network devices. The protocol should aim to leave the complexity to the
end hosts and require only simple processing in the programmable network devices. </t>
			<t hangText="Generality:"> The different service models and communication models should be all supported. The protocol should also be independent of the underlying network layer protocol. </t> 
            <t hangText="Openness:"> Since the performance requirements of the applications may vary, the flow control and reliability mechanism of the protocol should be open to different algorithms.</t>	
			<t hangText="Compatibility:"> The protocol should be able to coexist with the other transport protocols.</t>
	    </list></t>
      
</section>

  <section anchor="IANA" numbered="true" toc="default">
    <name>IANA Considerations</name> 
      <t>This document includes no request to IANA.</t>
  </section>
  
  <section anchor="Security" numbered="true" toc="default">
    <name>Security Considerations</name>  
      <t>tbd</t>
  </section>
    
  </middle>

<back>
    
      <references title="Normative References">
	    <?rfc include='reference.RFC.2119'?> 	
      </references>
	  
      <references title="Informative References">
	
	    <reference anchor="homa" target="http://dx.doi.org/10.48550/arXiv.2210.00714">
			<front>
				<title>It's Time to Replace TCP in the Datacenter</title>
				<author initials="J." surname="Ousterhout"/>
				<date year="2023"/>
			</front>
		</reference>
	
		<reference anchor="mtp" target="http://dx.doi.org/10.1145/3484266.3487382">
			<front>
				<title>TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP)</title>
				<author initials="B." surname="Stephens"/>
				<author initials="D." surname="Grassi"/>
				<author initials="H." surname="Almasi"/>
				<author initials="T." surname="Ji"/>
				<author initials="B." surname="Vamanan"/>
				<author initials="A." surname="Akella"/>
				<date year="2021"/>
			</front>
		</reference>

      </references>
    
  </back>
  
</rfc>
