<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt"?>

<rfc
 category="std"
 docName="draft-ietf-nfsv4-rpcrdma-version-two-07"
 indexInclude="true"
 ipr="pre5378Trust200902"
 obsoletes=""
 scripts="Common,Latin"
 sortRefs="true"
 submissionType="IETF"
 symRefs="true"
 tocDepth="2"
 tocInclude="true"
 updates=""
 version="3"
 xml:lang="en">

<front>

<title abbrev="RPC-over-RDMA Version 2">
RPC-over-RDMA Version 2 Protocol
</title>

<seriesInfo name="Internet-Draft" value="draft-ietf-nfsv4-rpcrdma-version-two-07"/>

<author initials="C." surname="Lever" fullname="Charles Lever" role="editor">
<organization abbrev="Oracle" showOnFrontPage="true">Oracle Corporation</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country>United States of America</country>
</postal>
<email>chuck.lever@oracle.com</email>
</address>
</author>

<author initials="D." surname="Noveck" fullname="David Noveck">
<organization showOnFrontPage="true">NetApp</organization>
<address>
<postal>
<street>1601 Trapelo Road</street>
<city>Waltham</city>
<region>MA</region>
<code>02451</code>
<country>United States of America</country>
</postal>
<phone>+1 781 572 8038</phone>
<email>davenoveck@gmail.com</email>
</address>
</author>

<date/>

<area>Transport</area>
<workgroup>Network File System Version 4</workgroup>
<keyword>NFS-Over-RDMA</keyword>

<abstract>
<t>
This document specifies the second version
of a transport protocol that conveys
Remote Procedure Call (RPC) messages
using Remote Direct Memory Access (RDMA).
This version of the protocol is extensible.
</t>
</abstract>

<note removeInRFC="true">
<t>
Discussion of this draft takes place
on the NFSv4 working group mailing list (nfsv4@ietf.org),
which is archived at
<eref target="https://mailarchive.ietf.org/arch/browse/nfsv4/"/>.
Working Group information can be found at
<eref target="https://datatracker.ietf.org/wg/nfsv4/about/"/>.
</t>
<t>
The source for this draft is maintained in GitHub.
Suggested changes can be submitted as pull requests at
<eref target="https://github.com/chucklever/i-d-rpcrdma-version-two"/>.
Instructions are on that page as well.
</t>
</note>

</front>

<middle>

<section
 anchor="section_72f6ba4a-aafb-4e9d-8b87-800ebccc5879"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Introduction</name>
<t>
Remote Direct Memory Access (RDMA)
<xref target="RFC5040" format="default" sectionFormat="of"/>
<xref target="RFC5041" format="default" sectionFormat="of"/>
<xref target="IBA" format="default" sectionFormat="of"/>
is a technique for moving data efficiently between network nodes.
By placing transferred data directly into destination buffers
using Direct Memory Access, RDMA delivers the reciprocal benefits of
faster data transfer
and
reduced host CPU overhead.
</t>
<t>
Open Network Computing Remote Procedure Call
(ONC RPC, often shortened in NFSv4 documents to RPC)
<xref target="RFC5531" format="default" sectionFormat="of"/>
is a Remote Procedure Call protocol
that runs over a variety of transports.
Most RPC implementations today use
UDP
<xref target="RFC0768" format="default" sectionFormat="of"/>
or
TCP
<xref target="RFC0793" format="default" sectionFormat="of"/>.
On UDP, a datagram encapsulates each RPC message.
Within a TCP byte stream,
a record marking protocol delineates RPC messages.
</t>
<t>
An RDMA transport, too, conveys RPC messages
in a fashion that must be fully defined
if RPC implementations are to interoperate
when using RDMA to transport RPC transactions.
Although RDMA transports encapsulate messages like UDP,
they deliver them reliably and in order, like TCP.
Further, they implement a bulk data transfer service
not provided by traditional network transports.
Therefore, we treat RDMA as a novel transport type for RPC.
</t>

<section
 anchor="section_3ade56d8-45ea-4ab5-b97c-da817d3e0033"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Design Goals</name>

<t>
The general mission of RPC-over-RDMA transports is to
leverage network hardware capabilities to
reduce host CPU needs related to the transport of RPC messages.
In particular, this includes
mitigating host interrupt rates
and
limiting the necessity to copy RPC payload bytes on receivers.
</t>
<t>
These hardware capabilities benefit both RPC clients and servers.
On balance, however, the RPC-over-RDMA protocol design approach
has been to bolster clients more than servers, as the client is
typically where applications are most hungry for CPU resources.
</t>
<t>
Additionally,
RPC-over-RDMA transports are designed to
support RPC applications transparently.
However, such transports can also provide mechanisms
that enable further optimization of data transfer
when RPC applications are structured
to exploit direct data placement.
In this context, the Network File System (NFS) family of protocols
(as described in
<xref target="RFC1094" format="default" sectionFormat="of"/>,
<xref target="RFC1813" format="default" sectionFormat="of"/>,
<xref target="RFC7530" format="default" sectionFormat="of"/>,
<xref target="RFC7862" format="default" sectionFormat="of"/>,
<xref target="RFC8881" format="default" sectionFormat="of"/>,
and subsequent NFSv4 minor versions)
are all potential beneficiaries of RPC-over-RDMA.
</t>
<t>
A complete problem statement appears in
<xref target="RFC5532" format="default" sectionFormat="of"/>.
</t>
</section>

<section
 anchor="section_0a2befc3-b5d7-468e-a48e-97c46a9c1b40"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Motivation for a New Version</name>
<t>
Storage administrators have broadly deployed
the RPC-over-RDMA version 1 protocol specified in
<xref target="RFC8166" format="default" sectionFormat="of"/>.
However, there are known shortcomings to this protocol:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The protocol's default size of Receive buffers forces
the use of RDMA Read and Write transfers for small payloads,
and limits the size of reverse-direction messages.
</li>
<li>
It is difficult to make optimizations or protocol fixes
that require changes to on-the-wire behavior.
</li>
<li>
For some RPC procedures, the maximum reply size is
difficult or impossible for an RPC client to estimate
in advance.
</li>
</ul>
<t>
To address these issues in a way that preserves interoperation
with existing RPC-over-RDMA version 1 deployments,
the current document presents
an updated version of the RPC-over-RDMA transport protocol.
</t>
<t>
This version of RPC-over-RDMA is extensible,
enabling the introduction of <bcp14>OPTIONAL</bcp14> extensions
without impacting existing implementations.
See
<xref target="section_d945b9f0-0666-4db7-9126-be57cf7b5f4f" format="default" sectionFormat="of"/>
for further discussion.
It introduces a mechanism to exchange implementation properties
to automatically provide further optimization of data transfer.
</t>
<t>
This version also contains incremental changes that
relieve performance constraints
and
enable recovery from unusual corner cases.
These changes are outlined in
<xref target="section_c2574344-5aec-427d-a5ed-048d7fcc0d95" format="default" sectionFormat="of"/>
and include
a larger default inline threshold,
the ability to convey a single RPC message using multiple RDMA Send operations,
support for authentication of connection peers,
richer error reporting,
improved credit-based flow control,
and
support for Remote Invalidation.
</t>
</section>

</section>

<section
 anchor="section_ef1a2819-4d22-40af-8d38-fde10849c872"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Requirements Language</name>
<t>
The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
"<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>",
"<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>",
"<bcp14>SHOULD NOT</bcp14>",
"<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>"
in this document are to be interpreted
as described in BCP&nbsp;14
<xref target="RFC2119" format="default" sectionFormat="of"/>
<xref target="RFC8174" format="default" sectionFormat="of"/>
when, and only when, they appear in all capitals, as shown here.
</t>
</section>

<section
 anchor="section_4dc39c9c-3770-491f-b674-f824e87e2143"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Terminology</name>

<section
 anchor="section_4d675e45-c377-48df-8029-c5c9f8c48f9f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Remote Procedure Calls</name>
<t>
This section highlights critical elements of the RPC protocol
<xref target="RFC5531" format="default" sectionFormat="of"/>
and
the External Data Representation (XDR)
<xref target="RFC4506" format="default" sectionFormat="of"/>
it uses.
RPC-over-RDMA version 2 enables
the transmission of RPC messges built using XDR
and
also uses XDR internally to describe its header format.
</t>

<section
 anchor="section_8d804fe5-c7c7-4c6c-92d8-888da10caaec"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Upper-Layer Protocols</name>
<t>
RPCs are an abstraction used to implement the operations of an Upper-Layer Protocol (ULP).
For RPC-over-RDMA, "ULP" refers to an RPC Program and Version tuple,
which is a versioned set of procedure calls that comprise a single well-defined API.
One example of a ULP is the Network File System Version 4.0
<xref target="RFC7530" format="default" sectionFormat="of"/>.
In the current document, the term "RPC consumer" refers to
an implementation of a ULP running on an RPC client.
</t>
</section>

<section
 anchor="section_17a77782-8b11-4fb5-af0b-e0da7759c10A"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC Procedures</name>
<t>
Like a local procedure call,
every RPC procedure has a set of "arguments" and a set of "results".
A calling context invokes an RPC procedure,
passing arguments to it,
and the procedure subsequently returns a set of results.
Unlike a local procedure call,
an RPC procedure is executed remotely rather than
in the local application's execution context.
</t>
</section>

<section
 anchor="section_97382254-b1a3-4e03-98e5-a0814b331bd0"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC Transactions</name>
<t>
The RPC protocol as described in
<xref target="RFC5531" format="default" sectionFormat="of"/>
is fundamentally a message-passing protocol
between one or more clients, where RPC consumers are running,
and a server, where a remote execution context is available
to process RPC transactions on behalf of these consumers.
</t>
<t>
ONC RPC transactions consist of two types of messages:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A CALL message, or "Call", requests work.
An RPC Call message is designated
by the value zero (0) in the message's msg_type field.
</li>
<li>
A REPLY message, or "Reply",
reports the results of work requested by an RPC Call message.
An RPC Reply message is designated
by the value one (1) in the message's msg_type field.
</li>
</ul>
<t>
<xref target="RFC5531" section="9" format="default" sectionFormat="of"/>
introduces the RPC transaction identifier,
or "XID" for short.
Each connection endpoint interprets the value of an XID
in the context of the message's msg_type field.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The sender of a Call message generates an arbitrary XID value
for each RPC that is unique among outstanding Calls from that sender.
</li>
<li>
The sender of a Reply message copies the XID of the initiating Call
to the Reply containing the results of that procedure.
</li>
</ul>
<t>
After receiving a Reply,
a Requester then matches the XID value in that Reply
with a Call it previously sent.
</t>
<t>
The ratio of Call messages to Reply messages is typically
but not always one-to-one.
</t>
<t>
The most common operational paradigm is when
a Requester sends a Call message to a Responder,
who then sends a Reply message back to the Requester
with the results of that procedure.
One Call message elicits a single Reply message in response.
A Responder never sends more than one Reply
for each received Call message.
</t>
<t>
A "retransmission" occurs when a Requester sends
exactly the same Call message,
with the same arguments and XID, more than once.
A Requester can retransmit if it believes the network layer
or Responder has dropped a Call message,
or if the Responder's Reply has been likewise lost.
To prevent unnecessary network traffic or the execution
of non-idempotent procedures multiple times,
Requesters avoid retransmitting needlessly.
</t>
<t>
In rare cases, an RPC procedure may not require any
results or even acknowledgement that the Responder
has executed the procedure.
In that case, the Requester sends a Call message
but no Reply is returned.
This document refers to that case as "Call-only".
</t>
</section>

<section
 anchor="section_7e64e30d-0519-449b-b0bf-45c3d103b0be"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Message Serialization</name>
<t>
RPC messages are always transmitted atomically.
RPC peers may interleave messages,
but the contents of individual messages
cannot be broken up or interleaved
without making the messages illegible.
</t>
<t>
An RPC peer acting as a "Requester"
serializes the procedure's arguments
and
conveys them to a "Responder" endpoint via an RPC Call message.
A Call message contains an RPC protocol header with a unique XID,
a header describing the requested upper-layer operation,
and all arguments.
</t>
<t>
An RPC peer acting as a "Responder"
deserializes these arguments and processes the requested procedure.
It then serializes the procedure's results into an RPC Reply message.
An RPC Reply message contains an RPC protocol header with the same XID,
a header describing the upper-layer reply,
and all results.
</t>
<t>
The Requester deserializes the results
and
allows the RPC consumer to proceed.
At this point, the RPC transaction
designated by the XID in the RPC Call message is complete,
and the XID is retired.
</t>
</section>

<section
 anchor="section_63f47fcd-629b-4b00-aa8b-dbf836401581"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC Transports</name>
<t>
The role of an "RPC transport" is
to mediate the exchange of RPC messages
between Requesters and Responders,
bridging the gap between
the RPC message abstraction
and
the native operations of a network transport
(e.g., a socket).
</t>
<t>
When an RPC transport type is connection-oriented,
RPC client endpoints initiate transport connections,
while RPC server endpoints wait passively to accept incoming connection requests.
RPC messages may also be exchanged without a connection association.
Because RPC-over-RDMA is a connection-oriented RPC transport,
connectionless operation is not discussed further in the current document.
</t>

<section
 anchor="section_7aafb376-73d8-4fa1-8888-97a02c9a58c1"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Failure Recovery</name>
<t>
So that appropriate and timely recovery action can be taken,
the transport implementation is responsible for notifying
a Requester when an RPC Call or Reply was not able to be conveyed.
Recovery can take the form of establishing a new connection,
re-sending RPC Calls, or terminating RPC transactions pending
on the Requester.
</t>
<t>
For instance, a connection loss may occur after a Responder
has received an RPC Call but before it can send the matching RPC Reply.
Once the transport notifies the Requester of the connection loss,
the Requester can re-send all pending RPC Calls on a fresh connection.
</t>
</section>

<section
 anchor="section_2432566f-67e7-4f35-8ec4-9ed44cecd8cc"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Forward Direction</name>
<t>
Traditionally, an RPC client acts as a Requester,
while an RPC service acts as a Responder.
The current document
refers to this direction of RPC message passing
as "forward-direction" operation.
</t>
</section>

<section
 anchor="section_189a59d0-9235-4c1e-a76f-ea2b20fd6c94"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reverse-Direction</name>
<t>
The RPC specification
<xref target="RFC5531" format="default" sectionFormat="of"/>
does not forbid performing RPC transactions
in the other direction.
An RPC service endpoint can act as a Requester,
in which case an RPC client endpoint acts as a Responder.
This direction of RPC message passing is known as
"reverse-direction" operation.
</t>
<t>
During reverse-direction operation,
an RPC client is responsible
for establishing transport connections,
even though the RPC server originates RPC Calls.
</t>
<t>
RPC clients and servers are usually optimized
to perform and scale well when handling traffic
in the forward direction.
They might not be prepared to handle operation
in the reverse direction.
Not until NFS version 4.1
<xref target="RFC8881" format="default" sectionFormat="of"/>
has there been a strong need
to handle reverse-direction operation.
</t>
</section>

<section
 anchor="section_05f24e3b-ad49-4370-a0fe-477b0f1364aa"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Bi-directional Operation</name>
<t>
A pair of connected RPC endpoints may choose to use
only forward-direction
or
only reverse-direction operation
on a particular transport connection.
Or, these endpoints may send Calls
in both directions concurrently
on the same transport connection.
</t>
<t>
"Bi-directional operation" occurs when both transport endpoints
act as a Requester and a Responder at the same time
on a single connection.
</t>
<t>
Bi-directionality is an extension
of RPC transport connection sharing.
Two RPC endpoints wish to exchange
independent RPC messages over a shared connection
but in opposite directions.
These messages may or may not be related
to the same workloads or RPC Programs.
</t>
<t>
During bi-directional operation,
forward- and reverse- direction XIDs
are typically generated
on distinct hosts by possibly different algorithms.
There is no coordination between the generation of XIDs
used in forward-direction and reverse-direction operation.
</t>
<t>
Therefore, a forward-direction Requester
<bcp14>MAY</bcp14> use the same XID value at the same time
as a reverse-direction Requester
on the same transport connection.
Although such concurrent requests use the same XID value,
they represent distinct RPC transactions.
</t>
</section>

</section>

<section
 anchor="section_98bdc62c-0af4-4379-8b5c-6d98b7a520c7"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>External Data Representation</name>
<t>
One cannot assume that all Requesters and Responders
represent data objects in the same way internally.
RPC uses External Data Representation (XDR)
to translate native data types and serialize arguments and results
<xref target="RFC4506" format="default" sectionFormat="of"/>.
</t>
<t>
XDR encodes data independently
of the endianness or size of host-native data types,
enabling unambiguous decoding of data by a receiver.
</t>
<t>
XDR assumes only that the number of bits in a byte (octet)
and
their order are the same on both endpoints and the physical network.
The smallest indivisible unit of XDR encoding is a group of four octets.
XDR can also flatten
lists,
arrays,
and
other complex data types
into a stream of bytes.
</t>
<t>
We refer to a serialized stream of bytes
that is the result of XDR encoding
as an "XDR stream".
A sender encodes native data
into an XDR stream and then transmits that stream to a receiver.
The receiver decodes incoming XDR byte streams
into its native data representation format.
</t>

<section
 anchor="section_c6d3092c-99e6-4cce-b377-fffc4862929F"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Opaque Data</name>
<t>
Sometimes, a data item is to be transferred as-is,
without encoding or decoding.
We refer to the contents of such a data item as "opaque data".
XDR encoding places the content of opaque data items
directly into an XDR stream without altering it in any way.
ULPs or applications perform
any needed data translation in this case.
Examples of opaque data items include the content of files
or generic byte strings.
</t>
</section>

<section
 anchor="section_c210323f-c524-4e98-a02d-23549a4bebc5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Roundup</name>
<t>
The number of octets in a variable-length data item
precedes that item in an XDR stream.
If the size of an encoded data item is not a multiple of four octets,
the sender appends octets containing zero after the end of the data item.
These zero octets shift the next encoded data item in the XDR stream
so that it always starts on a four-octet boundary.
The addition of extra octets does not change
the encoded size of the data item.
Receivers do not expose the extra octets to ULPs.
</t>
<t>
We refer to this technique as "XDR roundup",
and the extra octets as "XDR roundup padding".
</t>
</section>

</section>

</section>

<section
 anchor="section_de830270-64ed-4510-ac25-29837d352031"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Remote Direct Memory Access</name>
<t>
When a third party transfers large RPC payloads,
RPC Requesters and Responders can become more efficient.
An example of such a third party might be
an intelligent network interface
(data movement offload),
which places data in the receiver's memory so that
no additional adjustment of data alignment is necessary
(direct data placement or "DDP").
RDMA transports enable both of these optimizations.
</t>
<t>
In the current document, the standalone term "RDMA" refers to
the physical mechanism an RDMA transport utilizes when moving data.
</t>

<section
 anchor="section_1b97ecfd-7aba-4299-9007-dab28ac76f81"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Direct Data Placement</name>
<t>
Typically, RPC implementations copy
the contents of RPC messages into a buffer before being sent.
An efficient RPC implementation sends bulk data
without first copying it into a separate send buffer.
</t>
<t>
However, socket-based RPC implementations
are often unable to receive data directly
into its final place in memory.
Receivers often need to copy incoming data
to finish an RPC operation,
if only to adjust data alignment.
</t>
<t>
Although it may not be efficient,
before an RDMA transfer, a sender may copy data into an intermediate buffer.
After an RDMA transfer, a receiver may copy that data again to its final destination.
In this document, the term "DDP" refers to
any optimized data transfer where a receiving host's CPU
does not move transferred data
to another location after arrival.
</t>
<t>
RPC-over-RDMA version 2 enables the use of RDMA Read and Write operations
to achieve both data movement offload and DDP.
However, note that
not all RDMA-based data transfer qualifies as DDP,
and
some mechanisms that do not employ explicit RDMA can place data directly.
</t>
</section>

<section
 anchor="section_6903045e-bd1c-4e12-bf96-6b534989f46A"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Transport Operation</name>
<t>
RDMA transports require that
RDMA consumers provision resources in advance
to achieve good performance during receive operations.
An RDMA consumer might provide Receive buffers in advance
by posting an RDMA Receive Work Request
for every expected RDMA Send from a remote peer.
These buffers are provided
before the remote peer posts RDMA Send Work Requests.
Thus this is often referred to as "pre-posting" buffers.
</t>
<t>
An RDMA Receive Work Request remains outstanding
until the RDMA provider matches it to an inbound Send operation.
The resources associated with that Receive must be retained in
host memory, or "pinned", until the Receive completes.
</t>
<t>
Given these tenets of operation,
the RPC-over-RDMA version 2 protocol assumes
each transport provides the following abstract operations.
A more complete discussion of these operations appears in
<xref target="RFC5040" format="default" sectionFormat="of"/>.
</t>

<section
 anchor="section_90f88ba5-5ad6-4ac1-b40d-ed9247e61ca5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Memory Registration</name>
<t>
Memory registration assigns a steering tag
to a region of memory,
permitting the RDMA provider
to perform data-transfer operations.
The RPC-over-RDMA version 2 protocol assumes that
a steering tag of no more than 32 bits and memory
addresses of up to 64 bits in length
identifies each registered memory region.
</t>
</section>

<section
 anchor="section_07bba55f-c48f-474c-918b-db6c9d2325dd"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Send</name>
<t>
The RDMA provider supports an RDMA Send operation,
with completion signaled on the receiving peer
after the RDMA provider has placed data in a pre-posted buffer.
Sends complete at the receiver
in the order they were posted at the sender.
The size of the remote peer's pre-posted buffers
limits the amount of data
that can be transferred by a single RDMA Send operation.
</t>
</section>

<section
 anchor="section_9be6a44c-1ea5-4ccd-b188-ee04e930497b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Receive</name>
<t>
The RDMA provider supports an RDMA Receive operation
to receive data conveyed by incoming RDMA Send operations.
To reduce the amount of memory that must remain pinned
awaiting incoming Sends,
the amount of memory posted per Receive is limited.
The RDMA consumer (in this case, the RPC-over-RDMA version 2 protocol)
provides flow control to prevent overrunning receiver resources.
</t>
</section>

<section
 anchor="section_cfd79bce-e8e9-4a51-b43a-b747af6213f4"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Write</name>
<t>
The RDMA provider supports an RDMA Write operation
to place data directly into a remote memory region.
The local host initiates an RDMA Write
and the RDMA provider signals completion there.
The remote RDMA provider does not signal completion
on the remote peer.
The local host provides
the steering tag,
the memory address,
and
the length of the remote peer's memory region.
</t>
<t>
RDMA Writes are not ordered relative to one another,
but are ordered relative to RDMA Sends.
Thus, a subsequent RDMA Send completion
signaled on the local peer
guarantees that prior RDMA Write data
has been successfully placed in the remote peer's memory.
</t>
</section>

<section
 anchor="section_f37121af-49ff-4575-a699-7310f4ae1296"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Read</name>
<t>
The RDMA provider supports an RDMA Read operation
to place remote source data directly into local memory.
The local host initiates an RDMA Read
and and the RDMA provider signals completion there.
The remote RDMA provider does not signal
completion on the remote peer.
The local host provides
the steering tags,
the memory addresses,
and the lengths for the remote source
and
local destination memory regions.
</t>
<t>
The RDMA consumer (in this case, the RPC-over-RDMA version 2 protocol)
signals Read completion to the remote peer
as part of a subsequent RDMA Send message.
The remote peer can then invalidate steering tags
and
subsequently free associated source memory regions.
</t>
</section>

</section>

</section>

</section>

<section
 anchor="section_5ae4b016-9b44-4649-9021-5ae851ac9326"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC-over-RDMA Framework</name>
<t>
Before an RDMA data transfer can occur,
an endpoint first exposes regions of its memory to a remote endpoint.
The remote endpoint then initiates RDMA Read and Write operations
against the exposed memory.
A "transfer model" designates
which endpoint exposes its memory
and
which is responsible for initiating the transfer of data.
</t>
<t>
In RPC-over-RDMA version 2,
only Requesters expose their memory to the Responder,
and only Responders initiate RDMA Read and Write operations.
Read access to memory regions enables the Responder to
pull RPC arguments
or
whole RPC Calls from each Requester.
The Responder pushes
RPC results
or
whole RPC Replies to a Requester's
memory regions to which it has write access.
</t>

<section
 anchor="section_195e0288-862d-40bb-a259-4239930c728a"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Message Framing</name>
<t>
Each RPC-over-RDMA version 2 message consists of at most two XDR streams:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The "Transport stream" contains a header that describes
and controls the transfer of the Payload stream
in this RPC-over-RDMA message.
Every RDMA Send on an RPC-over-RDMA version 2 connection
<bcp14>MUST</bcp14> begin with a Transport stream.
</li>
<li>
The "Payload stream" contains part or all of a single RPC message.
The sender <bcp14>MAY</bcp14> divide an RPC message at any convenient boundary
but
<bcp14>MUST</bcp14> send RPC message fragments in XDR stream order
and
<bcp14>MUST NOT</bcp14> interleave Payload streams from multiple RPC messages.
</li>
</ul>
<t>
The RPC-over-RDMA framing mechanism described in this section
replaces all other RPC framing mechanisms.
Connection peers use RPC-over-RDMA framing
even when the underlying RDMA protocol runs
on a transport type with well-defined RPC framing, such as TCP.
However, a ULP can negotiate the use of RDMA,
dynamically enabling the use of RPC-over-RDMA on a connection
established on some other transport type.
Because RPC framing delimits an entire RPC request or reply,
the resulting shift in framing must occur between distinct RPC messages,
and in concert with the underlying transport.
</t>
</section>

<section
 anchor="section_130ce79c-8b13-479e-8108-a943024047dD"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reliable Message Delivery</name>
<t>
RPC-over-RDMA provides
a reliable
and
in-order
data transport service for RPC Calls and Replies.
</t>
<t>
RPC-over-RDMA transports
<bcp14>MUST</bcp14>
operate only on a reliable Queue Pair (QP) such as
the RDMA RC (Reliable Connected) QP type
as defined in Section&nbsp;9.7.7 of
<xref target="IBA" format="default" sectionFormat="of"/>.
The Marker PDU Aligned (MPA) protocol
<xref target="RFC5044" format="default" sectionFormat="of"/>,
when deployed on a reliable transport such as TCP,
provides similar functionality.
Using a reliable QP type ensures
in-transit data integrity
and
proper recovery from packet loss in the lower layers.
</t>
<t>
If any pre-posted Receive buffer on the connection
is not large enough to contain an incoming message,
the receiving RDMA provider
cannot deliver that message to the upper-layer consumer.
Likewise, if no pre-posted Receive buffer is available
to accept an incoming message,
the receiving RDMA provide
cannot pass that message to the consumer.
Exceeding these limits results in
a transition to a QP error state,
the loss of an in-flight message,
and
the potential loss of the connection.
</t>
<t>
Therefore, senders need to respect peer receiver resource limits
to ensure that the transport service
can deliver every message reliably.
Two operational parameters communicate these limits
between RPC-over-RDMA peers:
credits and inline threshold.
</t>

<section
 anchor="section_45c67eb8-8dc6-47c3-8555-14270f1514bF"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Flow Control</name>
<t>
RPC-over-RDMA version 2 employs
end-to-end credit-based flow control on each connection
to prevent a sender from transmitting more messages
than a receiver is prepared to accept
<xref target="CBFC" format="default" sectionFormat="of"/>.
Credit-based flow control is relatively simple, providing
automated management of receive buffer allocation
and
robust operation in the face of bursty traffic
while enabling effective pipelining.
The RPC-over-RDMA version 2 flow control mechanism relies on
reliable and in-order message delivery guarantees
provided by the underlying RDMA transport service.
</t>
<t>
An RPC-over-RDMA version 2 credit represents
the capability to convey
exactly one RPC-over-RDMA version 2 message,
regardless of its size,
via an RDMA Send/Receive pair.
Because an RPC-over-RDMA version 2 connection is full-duplex,
each connection peer has its own set of credits.
The two peers manage their credit limits independently,
although they communicate these values
by piggy-backing them on a message in the opposite direction.
</t>
<t>
Each peer tracks four critical values for each connection.
The peer uses these values to determine when it is safe to send
a message on the connection.
</t>
<dl newline="false" spacing="normal">
<dt>Sent message count:</dt>
<dd>
The total number of RDMA Send channel operations
the peer has posted on the connection.
The peer
<bcp14>MUST</bcp14>
set this value to zero (0) when a connection is first established.
</dd>
<dt>Received message count:</dt>
<dd>
The total number of RDMA Receive channel operations
that have completed on the connection.
The peer
<bcp14>MUST</bcp14>
set this value to zero (0) when a connection is first established.
</dd>
<dt>Advertised credits:</dt>
<dd>
The number of messages the peer is prepared to receive.
This value is typically equal to or less than
the number of RDMA Receive channel operations
that are currently pending on the connection.
</dd>
<dt>Received credits:</dt>
<dd>
The value in the rdma_start.rdma_credit field
in the most recently received message on the connection.
The peer
<bcp14>MUST</bcp14>
set this value to one (1) when a connection is first established.
</dd>
</dl>
<t>
When constructing an RPC-over-RDMA header to be sent,
a sender
<bcp14>MUST</bcp14>
set the header's rdma_start.rdma_credit field
to the sum of the connection's "Sent message count"
and "Advertised credits" values.
The sender
<bcp14>MUST NOT</bcp14>
post this message if the connection's "Sent message counter"
is greater than the connection's current "Received credits" value.
To handle counter wrapping, the sender uses appropriate modulo
arithmetic to perform this comparison.
</t>
<t>
Because the rdma_start.rdma_credit field is 32 bits wide,
credit limit values
<bcp14>MUST</bcp14>
be less than 2^31 - 1.
For a given bandwidth-delay product,
a peer selects an "Advertised credits" value
that is large enough to maximize throughput
while not overwhelming its local memory resources.
</t>
<t>
A peer
<bcp14>MAY</bcp14>
adjust its "Advertised credits"
to match the needs or policies in effect on either peer.
For instance, a peer may reduce its "Advertised credits"
to accommodate the available resources in a Shared Receive Queue.
Certain RDMA implementations may impose additional flow-control restrictions,
such as limits on RDMA Read operations in progress at the Responder.
Accommodation of such checks is considered
the responsibility of each RPC-over-RDMA version 2 implementation.
</t>

<section
 anchor="section_f6348562-97f8-4413-96d7-bde3ff57b375"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Asynchronous Credit Grants</name>
<t>
Credit accounting information is usually piggy-backed
on payload-bearing messages.
However, on occasion, a peer might need to
reset its credit limit without sending an RPC payload.
A receiving peer can send a message using a special header type
when the sender's credit limit approaches exhaustion
during a stream of unacknowledged messages.
See
<xref target="section_d7b171ca-326d-45ed-bfa6-eca86ae4a62e" format="default" sectionFormat="of"/>
for information about this header type.
</t>
<t>
Unlike RPC-over-RDMA version 1,
the credit grant on an RPC-over-RDMA version 2
connection
<bcp14>MAY</bcp14>
be zero.
In that case, the sender waits
until the receiver sends it an asynchronous credit refresh.
To prevent a sender from ever having to wait for a credit refresh,
a good receiver implementation provides a credit refresh
before half its credit limit is exceeded.
</t>
<t>
To prevent transport deadlock,
receivers
<bcp14>MUST</bcp14>
always be in a position to receive
one asynchronous credit update message,
in addition to payload-bearing messages.
A receiver can do this is by
posting one more RDMA Receive
than its "Advertised credits" value.
</t>
</section>

</section>

<section
 anchor="section_653e7ab4-782f-43f2-947f-d097ada8b3c9"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Inline Threshold</name>
<t>
An "inline threshold" value is the largest message size
(in octets) that can be conveyed in one direction between
peer implementations using RDMA Send and Receive channel operations.
An inline threshold value is
less than or equal to
the largest number of octets
the sender can post in a single RDMA Send operation.
It is also
less than or equal to
the largest number of octets
the receiver can reliably accept via a single RDMA Receive operation.
</t>
<t>
Each connection has two inline threshold values.
There is one for messages flowing from Requester-to-Responder
(referred to as the "call inline threshold"),
and one for messages flowing from Responder-to-Requester
(referred to as the "reply inline threshold").
</t>
<t>
Peers can advertise their inline threshold values
via RPC-over-RDMA version 2 Transport Properties (see
<xref target="section_86248e99-ca60-478a-8aff-3fb387410077" format="default" sectionFormat="of"/>).
In the absence of an exchange of Transport Properties,
connection peers
<bcp14>MUST</bcp14>
assume both inline thresholds are 4096 octets.
</t>
</section>

</section>

<section
 anchor="section_9563f1f0-28a7-48e1-a752-e842575c6539"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Initial Connection State</name>
<t>
Immediately upon connection establishment, both peers
<bcp14>MUST</bcp14>
allow only one outstanding RPC-over-RDMA message
on the connection at a time
until both the transport protocol version is established
and both peers have received an initial credit limit.
Note that because RPC-over-RDMA versions 1 and 2 each use
a different flow control mechanism,
the meaning of the value in the rdma_start.rdma_credit field
depends on the value in the rdma_start.vers field.
</t>
<t>
The second word of each transport header conveys
the transport protocol version.
Immediately after the client establishes a connection,
it sends a single valid RPC-over-RDMA message
with the value two (2) in the rdma_start.rdma_vers field.
Because the server might support
only RPC-over-RDMA version 1,
this initial message <bcp14>MUST NOT</bcp14> be larger than
the version 1 default inline threshold of 1024 octets.
</t>

<section
 anchor="section_8db4c54e-c1ce-43ba-93b4-031e829960f5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Server Supports RPC-over-RDMA Version 2</name>
<t>
If the server supports RPC-over-RDMA version 2,
it sends RPC-over-RDMA messages back to the client
with the value two (2) in the rdma_start.rdma_vers field.
Both peers may assume the default inline threshold value
for RPC-over-RDMA version 2 connections (4096 octets).
</t>
</section>

<section
 anchor="section_bedc4e66-4295-4dd6-8ac9-dd06907a08ad"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Server Does Not Support RPC-over-RDMA Version 2</name>
<t>
If the server does not support RPC-over-RDMA version 2,
it <bcp14>MUST</bcp14> send an RPC-over-RDMA message to the client
with an XID that matches the client's first message,
RDMA2_ERROR in the rdma_start.rdma_htype field,
and with the error code RDMA2_ERR_VERS.
This message also reports the range
of RPC-over-RDMA protocol versions that the server supports.
To continue operation, the client selects
a protocol version in that range
for subsequent messages on this connection.
</t>
<t>
If the connection is dropped immediately after an
RDMA2_ERROR/RDMA2_ERR_VERS message is received,
the client should try to avoid a version negotiation loop
when re-establishing another connection.
It can assume that the server
does not support RPC-over-RDMA version 2.
A client can assume the same situation
(i.e., no server support for RPC-over-RDMA version 2)
if the initial negotiation message is lost or dropped.
Once the version negotiation exchange is complete,
both peers may use the default inline threshold value
for the negotiated transport protocol version.
</t>
</section>

<section
 anchor="section_1ffe4c69-b516-476a-bba7-41863709f48d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Client Does Not Support RPC-over-RDMA Version 2</name>
<t>
The server examines the RPC-over-RDMA protocol version
used in the first RPC-over-RDMA message it receives.
If it supports this protocol version,
it <bcp14>MUST</bcp14> use it in all subsequent messages it sends on that connection.
The client <bcp14>MUST NOT</bcp14> change the protocol version
for the duration of the connection.
</t>
</section>

</section>

<section
 anchor="section_cfa8877c-b905-455d-b420-bf7a4a7f7829"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Using Direct Data Placement</name>
<t>
RPC-over-RDMA version 2 provides a mechanism
for moving part of an RPC message via a data transfer
distinct from RDMA Send and Receive.
For example,
a sender can remove one or more XDR data items from the Payload stream.
These items are then conveyed via other mechanisms,
such as one or more RDMA Read or Write operations.
</t>

<section
 anchor="section_8966c401-1714-413c-9384-b1f71f0a920d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Chunks and Segments</name>
<t>
A Requester records the location information for each registered memory region
associated with an RPC payload
in the transport header of an RPC-over-RDMA message.
With this information, the Responder uses RDMA Read and Write operations to
retrieve arguments contained in the specified region of the Requester's memory
or
place results in that region.
</t>
<t>
A "segment" is a transport header data object
that contains the precise coordinates of a contiguous registered memory region.
Each segment contains the following information:
</t>
<dl newline="false" spacing="normal">
<dt>Handle:</dt>
<dd>
A steering Tag (STag) or R_key generated by registering this memory with the RDMA provider.
</dd>
<dt>Length:</dt>
<dd>
The length of the segment's memory region, in octets.
The length of a segment
<bcp14>MAY</bcp14>
be aligned to a single octet.
An "empty segment" is defined as a segment
with the value zero (0) in its length field.
</dd>
<dt>Offset:</dt>
<dd>
The offset or beginning memory address of the segment's memory region.
</dd>
</dl>
<t>
The meaning of the values contained in these fields is elaborated in
<xref target="RFC5040" format="default"/>.
</t>
<t>
A "chunk" is simply a set of segments that have a related purpose.
A Requester
<bcp14>MAY</bcp14>
divide a chunk into segments using any convenient boundaries.
The length of a chunk is defined as
the sum of the lengths of the segments that comprise it.
</t>
</section>

<section
 anchor="section_0e225040-d15f-4e9a-a0c3-afa115bb12c7"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reducing a Payload Stream</name>
<t>
We refer to a data item that a sender removes from a Payload stream
to transmit separately as a "reduced" data item.
After a sender has finished removing XDR data items from a Payload stream,
we refer to it as a "reduced" Payload stream.
A set of segments that describe memory regions
containing a single reduced data item is categorized as
a "data item chunk."
</t>
<t>
Not all XDR data items benefit from Direct Data Placement.
For example, small data items
or
data items that require XDR unmarshaling by the receiver
do not benefit from DDP.
Moreover, it is impractical for receivers to prepare for
every possible XDR data item in a protocol to appear in a data item chunk.
</t>
<t>
Specifying which data items are DDP-eligible is done
in separate standards track documents known as "Upper Layer Bindings".
A ULB identifies which XDR data items a peer
<bcp14>MAY</bcp14>
transfer using DDP.
We refer to such data items as "DDP-eligible."
Senders
<bcp14>MUST NOT</bcp14>
reduce any other XDR data items.
Detailed requirements for ULB specifications appear in
<xref target="section_9e003b83-66b5-43d7-b9ef-0f271c8d301b" format="default" sectionFormat="of"/>
of the current document.
</t>
</section>

<section
 anchor="section_8c9060c3-e70e-4f37-ad0d-2fd074145d55"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Moving Whole RPC Messages using Explicit RDMA</name>
<t>
RPC-over-RDMA version 2 also enables the movement of
a whole RPC message via data transfer
distinct from RDMA Send and Receive.
A sender registers the memory containing a Payload stream
without regard to data item boundaries or DDP-eligibility.
The Payload stream is then conveyed via other mechanisms,
such as one or more RDMA Read or Write operations.
A set of segments that describe memory regions
containing a Payload stream is categorized as
a "body chunk".
</t>
<t>
A sender may first reduce that Payload stream if it contains
one or more DDP-eligible data items.
The sender moves these data items using data items chunks,
and the reduced Payload stream using a body chunk.
</t>
</section>

</section>

<section
 anchor="section_c561e45c-fe88-47e3-bdfd-689d482fcad3"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Encoding Chunks</name>
<t>
The RPC-over-RDMA version 2 transport protocol
does not place a limit on chunk size.
However, each ULP may cap the amount of data that can be transferred
by a single RPC transaction.
For example, NFS implementations typically have settings
that restrict the payload size of NFS READ and WRITE operations.
The Responder can use such limits to sanity check chunk sizes
before using them in RDMA operations.
</t>

<section
 anchor="section_025098f1-1355-4c97-8fda-5b4859372aa5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Read Chunks</name>
<t>
A "Read chunk" contains data that its receiver pulls from the sender.
Each Read chunk is a set of one or more "Read segments"
encoded as a list.
A Read segment consists of a Position field
followed by a segment, as defined in
<xref target="section_8966c401-1714-413c-9384-b1f71f0a920d" format="default" sectionFormat="of"/>.
</t>
<dl newline="false" spacing="normal">
<dt>Position:</dt>
<dd>
The byte offset in the unreduced Payload stream
where the receiver reinserts the data item conveyed in the chunk.
The sender <bcp14>MUST</bcp14> compute the Position value
from the beginning of the unreduced Payload stream,
which begins at Position zero.
All segments in the same Read chunk share the same Position value,
even if one or more of the segments have a non-four-byte-aligned length.
The value in this field <bcp14>MUST</bcp14> be a multiple of four.
</dd>
</dl>
<t>
When constructing an RPC-over-RDMA message,
the sender registers memory regions containing
data intended for RDMA Read operations.
It advertises the coordinates of these regions in Read chunks added
to the transport header of an RPC-over-RDMA message.
</t>
<t>
The receiver of this message then pulls the chunk's data from the sender
using RDMA Read operations.
When receiving a Read chunk,
the receiver inserts the first Read segment in a Read chunk
into the Payload stream at the byte offset indicated by its Position field.
The receiver concatenates Read segments
whose Position field value matches this offset
until there are no more Read segments at that Position value.
</t>

<section
 anchor="section_52130e02-8afb-4a36-ac38-64a4cb73b5bf"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>The Read List</name>
<t>
Each RPC-over-RDMA message carries a list of Read segments
that make up the set of Read chunks for that message.
When no RDMA Read operations are needed
to complete the transmission of the message's Payload stream,
the message's Read list is empty.
</t>
<t>
If a Responder receives
a Read list whose segment position values
do not appear in monotonically increasing order, it
<bcp14>MUST</bcp14>
discard the message without processing it
and
respond with an RDMA2_ERROR message
with the rdma_xid field set to the XID of the malformed message and
the rdma_err field set to RDMA2_ERR_BAD_XDR.
</t>
</section>

<section
 anchor="section_a7327632-a379-43a5-92f7-d5a83582cd6d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>The Call Chunk</name>
<t>
The Call chunk is a Read chunk that acts as a body chunk
containing an RPC Call message.
A Requester can utilize a Call chunk at any time.
However, using a Call chunk is less efficient than
an RDMA Send.
</t>
<t>
A Read chunk may act as either a data item chunk or a body chunk.
When the chunk's position is zero, it acts as a body chunk.
Otherwise, it is a data item chunk containing exactly one XDR data item.
</t>
</section>

<section
 anchor="section_5f4cc5f7-325d-479c-b592-6e93979431d0"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Read Completion</name>
<t>
A Responder acknowledges that
it is finished with the Requester's Read chunk memory regions
when it sends the corresponding RPC Reply message.
The Requester may then invalidate memory regions belonging to
Read chunks associated with the associated RPC Call message.
</t>
</section>

</section>

<section
 anchor="section_b8492157-7734-43e8-9a90-8ee32c674a12"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Write Chunks</name>
<t>
Each "Write chunk" consists of a counted array of
zero or more segments, as defined in
<xref target="section_8966c401-1714-413c-9384-b1f71f0a920d" format="default"/>.
The function of a Write chunk depends on the direction of
the containing RPC-over-RDMA message.
In a Call message,
a Write chunk advertises registered memory regions into which the Responder
may push data.
In a Reply message,
a Write chunk reports how much data has been pushed.
</t>
<t>
A Requester provisions Write chunks for an RPC transaction
long before the Responder has constructed a corresponding Reply message.
A Requester typically does not know the actual length
of the result data items or Reply to be returned,
since the Reply does not yet exist.
Thus, a Requester
<bcp14>MUST</bcp14>
provision Write chunks large enough
to accommodate the maximum possible size of each returned data item.
</t>
<t>
An "empty Write chunk" is a Write chunk with a zero segment count.
By definition, the length of an empty Write chunk is zero.
An "unused Write chunk" has a non-zero segment count,
but all of its segments are empty segments.
</t>

<section
 anchor="section_480387ae-1a7c-46b8-b310-8badb7b2a62b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>The Write List</name>
<t>
Each RPC-over-RDMA message carries a list of Write chunks.
When no DDP-eligible data items are to appear
in the Reply to an RPC transaction,
the Requester provides an empty Write list in the RPC Call,
and the Responder leaves the Write list empty in the matching RPC Reply.
When a Write chunk appears in the Write list,
it acts only as a data item chunk.
</t>
<t>
For each Write chunk in the Write list,
the Responder pushes one DDP-eligible data item to the Requester.
It fills the chunk contiguously and in segment array order
until the Responder has written that data item to the Requester in its entirety.
The Responder <bcp14>MUST</bcp14> copy the segment count
and all segments from the Requester-provided Write chunk
into the RPC Reply message's transport header.
As it does so, the Responder updates each segment length field
to reflect the actual amount of data returned in that segment.
</t>
<t>
The Responder then sends the RPC Reply message via an RDMA Send operation.
</t>
</section>

<section
 anchor="section_af268e5f-c34b-4329-aa2a-578e4b47b914"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>The Reply Chunk</name>
<t>
The Reply chunk is a single Write chunk that acts as a body chunk.
that contains an RPC Reply message.
When a Requester estimates that the Reply message can exceed the
connection's ability to convey that Reply using RDMA Send operations,
it should provision a Reply chunk.
</t>
</section>

<section
 anchor="section_76f7431e-a3d6-4e62-8c8d-1844c564f944"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Write Completion</name>
<t>
A Responder acknowledges that
it is finished updating the Requester's Write chunk memory regions
when it sends the corresponding RPC Reply message.
The RDMA provider guarantees that the written data is at rest
before the next Receive operation, which typically contains the
corresponding RPC Reply, completes.
The Requester may then invalidate memory regions belonging to
Write chunks associated with the associated RPC Call message.
</t>
</section>

<section
 anchor="section_00f1edd4-2e2f-4b54-bc21-9d6dfa39556b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Write Chunk Roundup</name>
<t>
When provisioning a Write chunk for a variable-length result data item,
the Requester <bcp14>MUST NOT</bcp14> include additional space for XDR roundup padding.
A Responder <bcp14>MUST NOT</bcp14> write XDR roundup padding into a Write chunk,
even if the result is shorter than the available space in the chunk.
</t>
</section>

</section>

<section
 anchor="section_4acf0f2d-6719-43cf-8231-e9ce8fa9791e"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reducing Complex XDR Data Types</name>
<t>
XDR data items may appear in body chunks
without regard to their DDP-eligibility.
As body chunks contain a Payload stream, they
<bcp14>MUST</bcp14>
include all appropriate XDR roundup padding
to maintain proper XDR alignment of their contents.
</t>
<t>
However, a data item chunk
<bcp14>MUST</bcp14>
contain only one XDR data item, and the chunk
<bcp14>MUST</bcp14>
occupy a four-byte aligned length in the Payload stream
so that subsequent data items remain properly aligned
once the reduced data item is removed from the Payload stream.
</t>

<section
 anchor="section_a7aa6a3a-2d02-4c11-bb88-0ba2eb76b0c0"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Variable-Length Data Items</name>
<t>
When a sender reduces a variable-length XDR data item,
the length of the item
<bcp14>MUST</bcp14>
remain in the Payload stream.
The sender
<bcp14>MUST</bcp14>
omit the item's XDR roundup padding
from the Payload stream and the chunk.
The chunk's total length
<bcp14>MUST</bcp14>
be the same as the encoded length of the data item.
</t>
</section>

<section
 anchor="section_2b8d1299-b521-4198-b99c-25b4e344b8bb"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Counted Arrays</name>
<t>
When reducing a data item that is a counted array data type,
the count of array elements
<bcp14>MUST</bcp14>
remain in the Payload stream.
The sender
<bcp14>MUST</bcp14>
move the array elements into the chunk.
For example, when encoding an opaque byte array as a chunk,
the count of bytes stays in the Payload stream,
and the sender places the bytes in the array in the chunk.
</t>
<t>
Individual array elements appear in a chunk in their entirety.
For example, when encoding an array of arrays as a chunk,
the count of items in the enclosing array stays in the Payload stream.
But each enclosed array, including its item count,
is transferred as part of the chunk.
</t>
</section>

<section
 anchor="section_ec322087-6d38-42b9-b24b-831aabcfb5f9"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Optional-Data</name>
<t>
Similar to a counted array,
when reducing an optional-data data type,
the discriminator field
<bcp14>MUST</bcp14>
remain in the Payload stream.
The sender
<bcp14>MUST</bcp14>
place the data, when present, in the chunk.
</t>
</section>

<section
 anchor="section_5ae887e6-fd9d-4a9f-a298-925b87a6a5b2"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Unions</name>
<t>
A union data type
<bcp14>MUST NOT</bcp14>
be made DDP-eligible.
However, one or more of its arms
<bcp14>MAY</bcp14>
be made DDP-eligible, subject to the other requirements in this section.
</t>
</section>

</section>

</section>

<section
 anchor="section_41b8bd31-255d-4fab-a81b-df765f11ad47"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reverse-Direction Operation</name>
<t>
The terminology used in this section is introduced in
<xref target="section_189a59d0-9235-4c1e-a76f-ea2b20fd6c94" format="default" sectionFormat="of"/>.
</t>

<section
 anchor="section_3816dd27-8007-47b0-82da-9d69f578f29d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Sending a Reverse-Direction RPC Call</name>
<t>
An RPC-over-RDMA server endpoint constructs the transport header
for a reverse-direction RPC Call as follows:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The server generates a new XID value (see
<xref target="section_97382254-b1a3-4e03-98e5-a0814b331bd0" format="default" sectionFormat="of"/>
for full requirements) and places it
in the rdma_xid field of the transport header
and the xid field of the RPC Call message.
The RPC Call header <bcp14>MUST</bcp14> start with the same XID value
that is present in the transport header.
</li>
<li>
The rdma_vers field of each reverse-direction Call
<bcp14>MUST</bcp14> contain the same value as forward-direction Calls
on the same connection.
</li>
<li>
The server fills in the rdma_credit field
with the credit values for the connection, as described in
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>.
</li>
<li>
The server determines the Payload format for the RPC message
and fills in the rdma_htype field as appropriate
(see Sections
<xref target="section_740e5b29-8c88-40ab-9506-69635d9a8167" format="counter" sectionFormat="of"/>
and
<xref target="section_858fef67-6100-48bc-8d3d-d46201a4b461" format="counter" sectionFormat="of"/>).
<xref target="section_858fef67-6100-48bc-8d3d-d46201a4b461" format="default" sectionFormat="of"/>
also covers the disposition of the chunk lists.
</li>
</ul>
</section>

<section
 anchor="section_748614a9-26da-4846-b5ff-43b06823abab"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Sending a Reverse-Direction RPC Reply</name>
<t>
An RPC-over-RDMA client endpoint constructs the transport header
for a reverse-direction RPC Reply as follows:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The client copies the XID value from the matching RPC Call
and
places it in the rdma_xid field of the transport header
and
the xid field of the RPC Reply message.
The RPC Reply header <bcp14>MUST</bcp14> start with the same XID value
that is present in the transport header.
</li>
<li>
The rdma_vers field of each reverse-direction Call
<bcp14>MUST</bcp14> contain the same value as forward-direction Replies
on the same connection.
</li>
<li>
The client fills in the rdma_credit field
with the credit values for the connection, as described in
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>.
</li>
<li>
The client determines the Payload format for the RPC message
and fills in the rdma_htype field as appropriate
(see Sections
<xref target="section_740e5b29-8c88-40ab-9506-69635d9a8167" format="counter" sectionFormat="of"/>
and
<xref target="section_858fef67-6100-48bc-8d3d-d46201a4b461" format="counter" sectionFormat="of"/>).
<xref target="section_858fef67-6100-48bc-8d3d-d46201a4b461" format="default" sectionFormat="of"/>
also covers the disposition of the chunk lists.
</li>
</ul>
</section>

<section
 anchor="section_ed1c8ecf-360c-496f-ba93-117f016dfd4c"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>When Reverse-Direction Operation is Not Supported</name>
<t>
An RPC-over-RDMA transport endpoint does not
have to support reverse-direction operation.
There might be no mechanism in the transport implementation to do so.
Or, the transport implementation might support
operation in the reverse direction,
but the Upper-Layer Protocol might not
configure the transport to handle reverse-direction traffic.
</t>
<t>
If an endpoint is unprepared to receive a reverse-direction message,
loss of the RDMA connection might result.
Thus a denial of service can occur if an RPC server
continues to send reverse-direction messages
after a client that is not prepared to receive them
reconnects to that server.
</t>
<t>
Connection peers indicate their support for reverse-direction operation
as part of the exchange of Transport Properties
just after a connection is established (see
<xref target="section_6ace2d7f-044b-491f-97ea-5760345a2e8f" format="default" sectionFormat="of"/>).
</t>
<t>
When dealing with the possibility that the remote peer
has no transport level support for reverse-direction operation,
the Upper-Layer Protocol is responsible for informing peers
when reverse-direction operation is supported.
Otherwise, even a simple reverse-direction RPC NULL procedure
from a peer could result in a lost connection.
Therefore, an Upper-Layer Protocol <bcp14>MUST NOT</bcp14> perform
reverse-direction RPC operations
until the RPC client indicates support for them.
</t>
</section>

<section
 anchor="section_858fef67-6100-48bc-8d3d-d46201a4b461"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Using Chunks During Reverse-Direction Operation</name>
<t>
Reverse-direction operations can use chunks for
DDP-eligible data items
and
Special payload formats
the same way chunks are used in forward-direction operation.
Connection peers indicate their support for
using chunks in the reverse direction
as part of the exchange of Transport Properties
just after a connection is established (see
<xref target="section_6ace2d7f-044b-491f-97ea-5760345a2e8f" format="default" sectionFormat="of"/>).
</t>
<t>
However, an implementation might support
only Upper-Layer Protocols
that have no DDP-eligible data items.
Such Upper-Layer Protocols can use only small messages,
or they might have a native mechanism
for restricting the size of reverse-direction RPC messages,
obviating the need to handle chunks in the reverse direction.
</t>
<t>
When there is no Upper-Layer Protocol need
for chunks in the reverse direction,
implementers <bcp14>MAY</bcp14> choose not to provide support
for chunks in the reverse direction,
thus avoiding the complexity of implementing support
for RDMA Reads and Writes in the reverse direction.
When an RPC-over-RDMA transport implementation does not support
chunks in the reverse direction, RPC endpoints use only
the Simple Payload format without data item chunks
or
the Continued Payload format without data item chunks
to send RPC messages in the reverse direction.
</t>
<t>
If a reverse-direction Requester provides a non-empty chunk list
to a Responder that does not support chunks,
the Responder <bcp14>MUST</bcp14> report its lack of support
using one of the error values defined in
<xref target="section_87c5f543-7092-4faf-b13e-0e994e8023a7" format="default" sectionFormat="of"/>.
</t>
</section>

<section
 anchor="section_328c8cb7-80db-4b24-b871-2983e9648b45"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reverse-Direction Retransmission</name>
<t>
In rare cases, an RPC server cannot
complete an RPC transaction
or
cannot send a Reply.
In these cases, the Requester may send the RPC transaction again
using the same RPC XID.
</t>
<t>
In the forward direction,
an RPC client is the Requester.
The client is always responsible for ensuring
a transport connection is in place
before sending a dropped Call again.
</t>
<t>
With reverse-direction operation,
an RPC server is the Requester.
Because an RPC server is not responsible
for establishing transport connections with clients,
the Requester is unable to retransmit a reverse-direction Call
whenever there is no transport connection.
In this case,
the RPC server must wait for the RPC client
to re-establish a transport connection
before it can retransmit reverse-direction RPC Calls.
</t>
<t>
If the forward-direction Requester has no work to do,
it can be some time before the RPC client
re-establishes a transport connection.
An RPC server may need to abandon a pending reverse-direction RPC Call
to avoid waiting indefinitely for the client
to re-establish a transport connection.
</t>
<t>
Therefore forward-direction Requesters <bcp14>SHOULD</bcp14> maintain a transport connection
as long as the RPC server might send reverse-direction Calls.
For example, while an NFS version 4.1 client has
open delegated files
or
active pNFS layouts,
it maintains one or more transport connections
to enable the NFS server to perform callback operations.
</t>
</section>

</section>

<section
 anchor="section_65de80fe-06ae-49ac-a4d8-e88dbbf787ab"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Call-Only Operation</name>
<t>
There is no corresponding Reply to a Call-only procedure.
Thus there is no opportunity for the Responder to
indicate it has completed its use of Read or Call chunks
that hold arguments or the whole Call.
In addition,
Write and Reply chunks are not necessary
because there are no results and no Reply message.
Therefore, Requesters
<bcp14>
MUST NOT
</bcp14>
use chunks when sending Call-only RPC procedures.
</t>
</section>

</section>

<section
 anchor="section_86248e99-ca60-478a-8aff-3fb387410077"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Properties</name>
<t>
RPC-over-RDMA version 2 enables connection endpoints
to exchange information about implementation properties.
Compatible endpoints use this information to optimize data transfer.
Initially, only a small set of transport properties are defined.
The protocol provides header types
to exchange transport properties (see
<xref target="section_977e471d-6948-43da-b3c8-bbb36034bd3e" format="counter" sectionFormat="of"/>
and
<xref target="section_e8fbc443-1f44-4ec6-9132-9a28b4e6c870" format="counter" sectionFormat="of"/>).
</t>
<t>
Both the set of transport properties and the operations used
to communicate them may be extended.
Within RPC-over-RDMA version 2, such extensions are <bcp14>OPTIONAL</bcp14>.
A discussion of extending the set of transport properties appears in
<xref target="section_a355adad-f03b-41a6-94a8-4128b10301bb" format="default" sectionFormat="of"/>.
</t>

<section
 anchor="section_d5ac12f6-6735-48f3-b4ba-b44a19ff9298"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Properties Model</name>
<t>
The current document specifies a basic set of receiver and sender properties.
Such properties are specified using
a code point that identifies the particular transport property
and
a nominally opaque array containing the XDR encoding of the property.
</t>
<t>
The following XDR types handle transport properties:
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
typedef rpcrdma2_propid uint32;

struct rpcrdma2_propval {
        rpcrdma2_propid rdma_which;
        opaque          rdma_data<>;
};

typedef rpcrdma2_propval rpcrdma2_propset<>;

typedef uint32 rpcrdma2_propsubset<>;
]]>
</sourcecode>
<t>
The rpcrdma2_propid type specifies a distinct transport property.
The property code points are defined as const values rather than
elements in an enum type
to enable the extension by concatenating XDR definition files.
</t>
<t>
The rpcrdma2_propval type carries the value of a transport property.
The rdma_which field identifies the particular property,
and
the rdma_data field contains the associated value of that property.
A zero-length rdma_data field represents
the default value of the property specified by rdma_which.
</t>
<t>
Although the rdma_data field is opaque,
receivers interpret its contents using the XDR type
associated with the property specified by rdma_which.
When the contents of the rdma_data field
do not conform to that XDR type,
the receiver <bcp14>MUST</bcp14> return the error RDMA2_ERR_BAD_PROPVAL
using the header type RDMA2_ERROR,
as described in
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="default" sectionFormat="of"/>.
</t>
<t>
For example, the receiver of a message
containing a valid rpcrdma2_propval returns this error
if the length of rdma_data is greater than
the length of the transferred message.
Also, when the receiver recognizes
the rpcrdma2_propid contained in rdma_which,
it <bcp14>MUST</bcp14> report the error RDMA2_ERR_BAD_PROPVAL
if either of the following occurs:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The nominally opaque data within rdma_data is not valid when
interpreted using the property-associated typedef.
</li>
<li>
The length of rdma_data is insufficient to contain the data
represented by the property-associated typedef.
</li>
</ul>
<t>
A receiver does not report an error if it does not recognize
the value contained in rdma_which.
In that case, the receiver does not process that rpcrdma2_propval.
Processing continues with the next rpcrdma2_propval, if any.
</t>
<t>
The rpcrdma2_propset type specifies a set of transport properties.
The protocol does not impose a particular ordering of the rpcrdma2_propval items within it.
</t>
<t>
The rpcrdma2_propsubset type identifies
a subset of the properties in a rpcrdma2_propset.
Each bit in the mask denotes a particular element in a previously
specified rpcrdma2_propset.
If a particular rpcrdma2_propval is at position N in the array,
then bit number N mod 32 in word N div 32 specifies whether
the defined subset includes that particular rpcrdma2_propval.
Words beyond the last one specified are assumed to contain zero.
</t>
</section>

<section
 anchor="section_943010bd-c342-46b7-9fcd-df746437dd6f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Current Transport Properties</name>
<t>
<xref target="table_99d0e7cc-da81-4f16-9bd0-471f806bc0b6" format="default" sectionFormat="of"/>
specifies a basic set of transport properties.
The columns contain the following information:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The column labeled "Property" contains a name
of the transport property described by the current row.
</li>
<li>
The column labeled "Code" specifies the code point
that identifies this property.
</li>
<li>
The column labeled "XDR type" gives the XDR type
of the data used to communicate the value of this property.
This data type overlays the data portion
of the nominally opaque rdma_data field.
</li>
<li>
The column labeled "Default" gives the default value for the property.
</li>
<li>
The column labeled "Section" indicates the section within the current
document that explains the use of this property.
</li>
</ul>
<table align="left"
 anchor="table_99d0e7cc-da81-4f16-9bd0-471f806bc0b6">
<thead>
<tr>
<th align="left">Property</th>
<th align="left">Code</th>
<th align="left">XDR type</th>
<th align="left">Default</th>
<th align="left">Section</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Maximum Send Size</td>
<td align="left">1</td>
<td align="left">uint32</td>
<td align="left">4096</td>
<td align="left">
<xref target="section_0a985ff5-a5c1-477f-8932-517be34ccf65" format="counter" sectionFormat="of"/>
</td>
</tr>
<tr>
<td align="left">Receive Buffer Size</td>
<td align="left">2</td>
<td align="left">uint32</td>
<td align="left">4096</td>
<td align="left">
<xref target="section_5101b1f1-b1ad-4b6b-9fa4-d6fa324ffc0d" format="counter" sectionFormat="of"/>
</td>
</tr>
<tr>
<td align="left">Maximum Segment Size</td>
<td align="left">3</td>
<td align="left">uint32</td>
<td align="left">1048576</td>
<td align="left">
<xref target="section_14ed280d-521c-410c-a190-cf891be53900" format="counter" sectionFormat="of"/>
</td>
</tr>
<tr>
<td align="left">Maximum Segment Count</td>
<td align="left">4</td>
<td align="left">uint32</td>
<td align="left">16</td>
<td align="left">
<xref target="section_98fc720d-6263-4a52-ae89-d2469b982512" format="counter" sectionFormat="of"/>
</td>
</tr>
<tr>
<td align="left">Reverse-Direction Support</td>
<td align="left">5</td>
<td align="left">uint32</td>
<td align="left">0</td>
<td align="left">
<xref target="section_6ace2d7f-044b-491f-97ea-5760345a2e8f" format="counter" sectionFormat="of"/>
</td>
</tr>
<tr>
<td align="left">Host Auth Message</td>
<td align="left">6</td>
<td align="left">opaque&lt;&gt;</td>
<td align="left">N/A</td>
<td align="left">
<xref target="section_5f63e1b6-8d24-453b-b18b-b98ad66f3671" format="counter" sectionFormat="of"/>
</td>
</tr>
</tbody>
</table>

<section
 anchor="section_0a985ff5-a5c1-477f-8932-517be34ccf65"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Maximum Send Size</name>
<t>
The value of this property specifies
the maximum size, in octets, of Send payloads.
The endpoint receiving this value
can size its Receive buffers
based on the value of this property.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA2_PROPID_SBSIZ = 1;
typedef uint32 rpcrdma2_prop_sbsiz;
]]>
</sourcecode>
</section>

<section
 anchor="section_5101b1f1-b1ad-4b6b-9fa4-d6fa324ffc0d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Receive Buffer Size</name>
<t>
The value of this property specifies
the minimum size, in octets, of pre-posted receive buffers.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA2_PROPID_RBSIZ = 2;
typedef uint32 rpcrdma2_prop_rbsiz;
]]>
</sourcecode>
<t>
A sender can subsequently use this value to
determine when a message to be sent fits in
pre-posted receive buffers that the receiver has set up.
In particular:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
Requesters may use the value to determine
when to use a Call chunk or Message Continuation
when sending a Call.
</li>
<li>
Requesters may use the value to determine when
to provide a Reply chunk when sending a Call,
based on the maximum possible size of the Reply.
</li>
<li>
Responders may use the value to determine when
to use a Reply chunk provided by the Requester,
given the actual size of a Reply.
</li>
</ul>
</section>

<section
 anchor="section_14ed280d-521c-410c-a190-cf891be53900"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Maximum Segment Size</name>
<t>
The value of this property specifies
the maximum size, in octets,
of a segment this endpoint is prepared to send or receive.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA2_PROPID_RSSIZ = 3;
typedef uint32 rpcrdma2_prop_rssiz;
]]>
</sourcecode>
</section>

<section
 anchor="section_98fc720d-6263-4a52-ae89-d2469b982512"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Maximum Segment Count</name>
<t>
The value of this property specifies
the maximum number of segments
that can appear in a Requester's transport header.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA2_PROPID_RCSIZ = 4;
typedef uint32 rpcrdma2_prop_rcsiz;
]]>
</sourcecode>
</section>

<section
 anchor="section_6ace2d7f-044b-491f-97ea-5760345a2e8f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reverse-Direction Support</name>
<t>
The value of this property specifies
a client implementation's readiness to process
messages that are part of reverse-direction RPC requests.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA_RVRSDIR_NONE = 0;
const uint32 RDMA_RVRSDIR_SIMPLE = 1;
const uint32 RDMA_RVRSDIR_CONT = 2;
const uint32 RDMA_RVRSDIR_GENL = 3;

const uint32 RDMA2_PROPID_BRS = 5;
typedef uint32 rpcrdma2_prop_brs;
]]>
</sourcecode>
<t>
Multiple levels of support are distinguished:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The value RDMA2_RVRSDIR_NONE indicates
that the sender does not support reverse-direction operation.
</li>
<li>
The value RDMA2_RVRSDIR_SIMPLE indicates
that the sender supports using only Simple Format messages without data item chunks
for reverse-direction messages.
</li>
<li>
The value RDMA2_RVRSDIR_CONT indicates
that the sender supports using either
Simple Format without data item chunks
or
Continued Format messages without data item chunks
for reverse-direction messages.
</li>
<li>
The value RDMA2_RVRSDIR_GENL indicates
that the sender supports reverse-direction messages
in the same way as forward-direction messages.
</li>
</ul>
<t>
When a peer does not provide this property,
the default is the peer does not support
reverse-direction operation.
</t>
</section>

<section
 anchor="section_5f63e1b6-8d24-453b-b18b-b98ad66f3671"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Host Authentication Message</name>
<t>
The value of this transport property enables
the exchange of host authentication material.
This property can accommodate authentication handshakes
that require multiple challenge-response interactions
and potentially large amounts of material.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const uint32 RDMA2_PROPID_HOSTAUTH = 6;
typedef opaque rpcrdma2_prop_hostauth<>;
]]>
</sourcecode>
<t>
When this property is not present, the peer(s) remain unauthenticated.
Local security policy on each peer determines
whether the connection is permitted to continue.
</t>
</section>

</section>

</section>

<section
 anchor="section_eef6a22e-2633-44a2-a8f0-821fec8bf824"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Messages</name>
<t>
Each transport message consists of multiple sections.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A transport header prefix, as defined in
<xref target="section_2d1735f0-c465-43c6-9c18-3da6b7979862" format="default" sectionFormat="of"/>.
Among other things, this structure indicates the header type.
</li>
<li>
The transport header proper, as defined by one of the sub-sections below.
See
<xref target="section_67b34950-5376-49fd-93d7-b4fdf80d1c9b" format="default" sectionFormat="of"/>
for the mapping between header types and the corresponding header structure.
</li>
<li>
Potentially, all or part of an RPC message payload.
</li>
</ul>
<t>
This organization differs from that presented in the definition of
RPC-over-RDMA version 1
<xref target="RFC8166" format="default" sectionFormat="of"/>,
which defined the first and second of the items above as a single XDR data structure.
The new organization is in keeping with RPC-over-RDMA version 2's
extensibility model, which enables the definition of new header types
without modifying the XDR definition of existing header types.
</t>

<section
 anchor="section_67b34950-5376-49fd-93d7-b4fdf80d1c9b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Header Types</name>
<t>
<xref target="table_b5c31bf9-d623-4957-97db-29fc1d416cb8" format="default" sectionFormat="of"/>
lists the RPC-over-RDMA version 2 header types.
The columns contain the following information:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The column labeled "Operation" names
the particular operation.
</li>
<li>
The column labeled "Code" specifies
the value of the header type for this operation.
</li>
<li>
The column labeled "XDR type" gives
the XDR type of the data structure
used to organize the information in this new header type.
This data immediately follows the universal portion on the
transport header present in every RPC-over-RDMA transport header.
</li>
<li>
The column labeled "Msg" indicates whether this operation is
followed (or not) by an RPC message payload.
</li>
<li>
The column labeled "Section" refers to the section within the current
document that explains the use of this header type.
</li>
</ul>
<table align="left"
 anchor="table_b5c31bf9-d623-4957-97db-29fc1d416cb8">
<thead>
<tr>
<th align="left">Operation</th>
<th align="left">Code</th>
<th align="left">XDR type</th>
<th align="left">Msg</th>
<th align="left">Section</th>
</tr>
</thead>
<tbody>

<tr>
<td align="left">Report Transport Error</td>
<td align="left">4</td>
<td align="left">rpcrdma2_hdr_error</td>
<td align="left">No</td>
<td align="left">
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Grant Credits</td>
<td align="left">5</td>
<td align="left">void</td>
<td align="left">No</td>
<td align="left">
<xref target="section_d7b171ca-326d-45ed-bfa6-eca86ae4a62e" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Specify Properties (Middle)</td>
<td align="left">6</td>
<td align="left">rpcrdma2_hdr_connprop</td>
<td align="left">No</td>
<td align="left">
<xref target="section_977e471d-6948-43da-b3c8-bbb36034bd3e" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Specify Properties (Final)</td>
<td align="left">7</td>
<td align="left">rpcrdma2_hdr_connprop</td>
<td align="left">No</td>
<td align="left">
<xref target="section_e8fbc443-1f44-4ec6-9132-9a28b4e6c870" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey External RPC Call Message</td>
<td align="left">8</td>
<td align="left">rpcrdma2_hdr_call_external</td>
<td align="left">No</td>
<td align="left">
<xref target="section_7e1de71a-9b68-4bd0-8213-97991139ab87" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey Continued RPC Call Message</td>
<td align="left">9</td>
<td align="left">rpcrdma2_hdr_call_middle</td>
<td align="left">Yes</td>
<td align="left">
<xref target="section_3eb026d2-eccb-41bc-a09e-d922d935b23c" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey Inline RPC Call Message</td>
<td align="left">10</td>
<td align="left">rpcrdma2_hdr_call_inline</td>
<td align="left">Yes</td>
<td align="left">
<xref target="section_b8fd93f1-e5c8-47f2-b26e-921530cb0681" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey External RPC Reply Message</td>
<td align="left">11</td>
<td align="left">rpcrdma2_hdr_reply_external</td>
<td align="left">No</td>
<td align="left">
<xref target="section_780fd420-96d9-4cbe-af49-141ea89cdf6F" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey Continued RPC Reply Message</td>
<td align="left">12</td>
<td align="left">rpcrdma2_hdr_reply_middle</td>
<td align="left">Yes</td>
<td align="left">
<xref target="section_3d6d56ae-ba9b-432f-9fc0-403a5f622fe5" format="counter" sectionFormat="of"/>
</td>
</tr>

<tr>
<td align="left">Convey Inline RPC Reply Message</td>
<td align="left">13</td>
<td align="left">rpcrdma2_hdr_reply_inline</td>
<td align="left">Yes</td>
<td align="left">
<xref target="section_67bf730f-38f1-40bc-8eec-1bedde7b0449" format="counter" sectionFormat="of"/>
</td>
</tr>

</tbody>
</table>
<t>
RPC-over-RDMA version 2 peers are
<bcp14>REQUIRED</bcp14>
to support all message header types in
<xref target="table_b5c31bf9-d623-4957-97db-29fc1d416cb8" format="default" sectionFormat="of"/>.
RPC-over-RDMA version 2 implementations that receive an unrecognized header type
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR message
with an rdma_err field containing RDMA2_ERR_INVAL_HTYPE
and drop the incoming message without processing it further.
</t>
</section>

<section
 anchor="section_2e577c75-4e43-4e13-8b17-75afa849f0b6"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Headers and Chunks</name>
<t>
Most RPC-over-RDMA version 2 data structures have antecedents in
corresponding structures in RPC-over-RDMA version 1.
As is typical for new versions of an existing protocol,
the XDR data structures have new names,
and there are a few small changes in content.
In some cases,
there have been structural re-organizations
to enable protocol extensibility.
</t>

<section
 anchor="section_e21d4f74-b536-47f2-9d07-c03a27a20de4"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Common Transport Header Prefix</name>
<t>
The rpcrdma_common structure defines
the initial part of each RPC-over-RDMA transport header
for RPC-over-RDMA version 2 and subsequent versions.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
struct rpcrdma_common {
             uint32         rdma_xid;
             uint32         rdma_vers;
             uint32         rdma_credit;
             uint32         rdma_htype;
};
]]>
</sourcecode>
<t>
RPC-over-RDMA version 2's use of these first four words
aligns with that of version 1 as required by
<xref target="RFC8166" section="4.2" format="default" sectionFormat="of"/>.
However, there are crucial structural differences
in the XDR definition of RPC-over-RDMA version 2:
in the way that these words are described
by the respective XDR descriptions:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The header type is represented as a uint32 rather than as an enum type.
An enum would need to be modified to reflect additions to the set of
header types made by later extensions.
</li>
<li>
The header type field is part of an XDR structure devoted to
representing the transport header prefix,
rather than being part of a discriminated union,
that includes the body of each transport header type.
</li>
<li>
There is now a prefix structure
(see
<xref target="section_2d1735f0-c465-43c6-9c18-3da6b7979862" format="default" sectionFormat="of"/>)
of which the rpcrdma_common structure is the initial segment.
This prefix is a newly defined XDR object within the protocol description,
which constrains the universal portion of all header types
to the four words in rpcrdma_common.
</li>
</ul>
<t>
These changes are part of a more considerable structural change
in the XDR definition of RPC-over-RDMA version 2
that facilitates a cleaner treatment of protocol extension.
The XDR appearing in
<xref target="section_bf53e759-d97f-487d-a5e2-9b8153db1803" format="default" sectionFormat="of"/>
reflects these changes, which
<xref target="section_d945b9f0-0666-4db7-9126-be57cf7b5f4f" format="default" sectionFormat="of"/>
discusses in further detail.
</t>
</section>

</section>

<section
 anchor="section_8039c7b8-9068-401e-9cbd-5c1e67d403e7"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Header Types</name>
<t>
The header types defined and used in RPC-over-RDMA version 1
are not carried over into RPC-over-RDMA version 2,
although there are easy equivalents to the version 1
procedures:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The RDMA2_ERROR header (defined in
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="default" sectionFormat="of"/>)
has an XDR definition that differs from that in RPC-over-RDMA version 1,
and its modifications are all compatible extensions.
</li>
<li>
Senders use RDMA2_CALL_INLINE or RDMA2_REPLY_INLINE
(defined in Sections
<xref target="section_b8fd93f1-e5c8-47f2-b26e-921530cb0681" format="counter" sectionFormat="of"/>
and
<xref target="section_67bf730f-38f1-40bc-8eec-1bedde7b0449" format="counter" sectionFormat="of"/>)
in place of RDMA_MSG.
There are minor differences in the on-the-wire format
between the version 1 procedure and the version 2 header types.
</li>
<li>
Senders use RDMA2_CALL_EXTERNAL or RDMA2_REPLY_EXTERNAL
(defined in Sections
<xref target="section_7e1de71a-9b68-4bd0-8213-97991139ab87" format="counter" sectionFormat="of"/>
and
<xref target="section_780fd420-96d9-4cbe-af49-141ea89cdf6F" format="counter" sectionFormat="of"/>)
in place of RDMA_NOMSG.
There are minor differences in the on-the-wire format
between the version 1 procedure and the version 2 header types.
</li>
<li>
RDMA2_CONNPROP_MIDDLE and RDMA2_CONNPROP_FINAL
(defined in Sections
<xref target="section_977e471d-6948-43da-b3c8-bbb36034bd3e" format="counter" sectionFormat="of"/>
and
<xref target="section_e8fbc443-1f44-4ec6-9132-9a28b4e6c870" format="counter" sectionFormat="of"/>)
are new header types devoted to
enabling connection peers to exchange information
about their transport properties.
</li>
</ul>

<section
 anchor="section_b1d23e5c-31df-483f-adb7-25430b5de38d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERROR: Report Transport Error</name>
<t>
RDMA2_ERROR reports a transport layer error on a previous transmission.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_ERROR = 4;

struct rpcrdma2_err_vers {
        uint32 rdma_vers_low;
        uint32 rdma_vers_high;
};

struct rpcrdma2_err_write {
        uint32 rdma_chunk_index;
        uint32 rdma_length_needed;
};

union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) {
        case RDMA2_ERR_VERS:
          rpcrdma2_err_vers rdma_vrange;
        case RDMA2_ERR_READ_CHUNKS:
          uint32 rdma_max_chunks;
        case RDMA2_ERR_WRITE_CHUNKS:
          uint32 rdma_max_chunks;
        case RDMA2_ERR_SEGMENTS:
          uint32 rdma_max_segments;
        case RDMA2_ERR_WRITE_RESOURCE:
          rpcrdma2_err_write rdma_writeres;
        case RDMA2_ERR_REPLY_RESOURCE:
          uint32 rdma_length_needed;
        default:
          void;
};
]]>
</sourcecode>
<t>
See
<xref target="section_995a7597-4e89-48c3-b142-35b783ef1329" format="default" sectionFormat="of"/>
for details on the use of this header type.
</t>
</section>

<section
 anchor="section_d7b171ca-326d-45ed-bfa6-eca86ae4a62e"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_GRANT: Grant Credits</name>
<t>
The RDMA2_GRANT header type enables a connection peer
to update credit information
without conveying a payload.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_GRANT = 5;
]]>
</sourcecode>
<t>
This message carries no payload except for a struct rpcrdma2_hdr_prefix.
The rdma_xid field is unused.
Senders
<bcp14>MUST</bcp14>
set the rdma_xid field to zero and receivers
<bcp14>MUST</bcp14>
ignore the value in this field.
</t>
</section>

<section
 anchor="section_977e471d-6948-43da-b3c8-bbb36034bd3e"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_CONNPROP_MIDDLE: Exchange Transport Properties</name>
<t>
The RDMA2_CONNPROP_MIDDLE header type enables a connection peer
to publish the properties of its implementation to its
remote peer.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_CONNPROP_MIDDLE = 6;

struct rpcrdma2_hdr_connprop {
        rpcrdma2_propset rdma_props;
};
]]>
</sourcecode>
<t>
A peer sends an RDMA2_CONNPROP_MIDDLE header type
when it has one or more properties to send that
do not fit within
the default inline threshold for the RPC-over-RDMA version that is in effect.
</t>
<t>
A peer may encounter properties that it does not recognize or support.
In such cases, the receiver ignores unsupported properties
without generating an error response.
</t>
<t>
If a peer sends follows an RDMA2_CONNPROP_MIDDLE
header type with anything other than
another RDMA2_CONNPROP_MIDDLE message
or
an RDMA2_CONNPROP_FINAL message,
the receiver
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR header type
and
set its rdma_err field to RDMA2_ERR_INVAL_CONT
and drop the incoming message without processing it further.
</t>
</section>

<section
 anchor="section_e8fbc443-1f44-4ec6-9132-9a28b4e6c870"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_CONNPROP_FINAL: Exchange Transport Properties</name>
<t>
The RDMA2_CONNPROP_FINAL header type enables a connection peer
to publish the properties of its implementation to its
remote peer.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_CONNPROP_FINAL = 7;

struct rpcrdma2_hdr_connprop {
        rpcrdma2_propset rdma_props;
};
]]>
</sourcecode>
<t>
Each peer sends an RDMA2_CONNPROP_FINAL header type
as the final CONNPROP-type message
after the client has established a connection.
The size of this message is limited to
the default inline threshold for the RPC-over-RDMA version that is in effect.
</t>
<t>
A peer may encounter properties that it does not recognize or support.
In such cases, the receiver ignores unsupported properties
without generating an error response.
</t>
<t>
If a peer sends a CONNPROP-type message on a connection
after it has sent an RDMA2_CONNPROP_FINAL message,
the receiver
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR header type
and
set its rdma_err field to RDMA2_ERR_INVAL_CONT
and drop the incoming message without processing it further.
</t>
</section>

<section
 anchor="section_7e1de71a-9b68-4bd0-8213-97991139ab87"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_CALL_EXTERNAL: Convey External RPC Call Message</name>
<t>
RDMA2_CALL_EXTERNAL conveys an RPC Call message payload
using explicit RDMA operations.
The Responder reads the Payload stream from a memory area
specified by the Call chunk.
The sender
<bcp14>MUST</bcp14>
set the rdma_xid field to the same value
as the xid of the RPC Reply message payload.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_CALL_EXTERNAL = 8;

struct rpcrdma2_hdr_call_external {
        uint32                      rdma_inv_handle;

        struct rpcrdma2_read_list   *rdma_call;
        struct rpcrdma2_read_list   *rdma_reads;
        struct rpcrdma2_write_list  *rdma_provisional_writes;
        struct rpcrdma2_write_chunk *rdma_provisional_reply;
};
]]>
</sourcecode>
<dl newline="false" spacing="normal">
<dt>rdma_inv_handle:</dt>
<dd>
The rdma_inv_handle field contains a 32-bit RDMA handle that
the Responder may use in a Send With Invalidation operation.
See
<xref target="section_a957db67-a8fd-4886-b7a7-57382cfe3190" format="default" sectionFormat="of"/>.
</dd>
<dt>rdma_call:</dt>
<dd>
The rdma_call field anchors a list of one or more Read segments
that contain the RPC Call's Payload stream.
</dd>
<dt>rdma_reads:</dt>
<dd>
The rdma_reads field anchors a list of zero or more Read segments
that contain data item chunks.
</dd>
<dt>rdma_provisional_writes:</dt>
<dd>
The rdma_writes field
anchors a list of zero or more provisional Write chunks.
</dd>
<dt>rdma_provisional_reply:</dt>
<dd>
The rdma_reply field is a list containing zero or one provisional Reply chunk.
</dd>
</dl>
</section>

<section
 anchor="section_3eb026d2-eccb-41bc-a09e-d922d935b23c"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_CALL_MIDDLE: Convey Continued RPC Call Message</name>
<t>
RDMA2_CALL_MIDDLE conveys a beginning or middle portion
of an RPC Call message
immediately following the transport header in the send buffer.
The sender
<bcp14>MUST</bcp14>
set the rdma_xid field to the same value
as the xid of the RPC Reply message payload.
The sender sets the rdma_remaining field
to the number of bytes in the RPC Call message payload
that remain to be sent.
The rdma_rpc_first_word field demarks the first word
of the Payload stream.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_CALL_MIDDLE = 9;

struct rpcrdma2_hdr_call_middle {
        uint32                      rdma_remaining;

        /* The rpc message starts here and continues
         * through the end of the transmission. */
        uint32                      rdma_rpc_first_word;
};
]]>
</sourcecode>
<t>
If a peer sends follows an RDMA2_CALL_MIDDLE
header type with anything other than
an RDMA2_CALL_MIDDLE message
or
an RDMA2_CALL_INLINE message,
the receiver
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR header type
and
set its rdma_err field to RDMA2_ERR_INVAL_CONT
and drop the incoming message without processing it further.
</t>
</section>

<section
 anchor="section_b8fd93f1-e5c8-47f2-b26e-921530cb0681"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_CALL_INLINE: Convey Inline RPC Call Message</name>
<t>
RDMA2_CALL_INLINE conveys the only or final portion
of an RPC Call message.
The rdma_rpc_first_word field demarks the first word
of this Payload stream.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_CALL_INLINE = 10;

struct rpcrdma2_hdr_call_inline {
        uint32                      rdma_inv_handle;

        struct rpcrdma2_read_list   *rdma_reads;
        struct rpcrdma2_write_list  *rdma_provisional_writes;
        struct rpcrdma2_write_chunk *rdma_provisional_reply;

        /* The rpc message starts here and continues
         * through the end of the transmission. */
        uint32                      rdma_rpc_first_word;
};
]]>
</sourcecode>
<dl newline="false" spacing="normal">
<dt>rdma_inv_handle:</dt>
<dd>
The rdma_inv_handle field contains a 32-bit RDMA handle that
the Responder may use in a Send With Invalidation operation.
See
<xref target="section_a957db67-a8fd-4886-b7a7-57382cfe3190" format="default" sectionFormat="of"/>.
</dd>
<dt>rdma_reads:</dt>
<dd>
The rdma_reads field anchors a list of zero or more Read segments
that contain only data item chunks.
A Requester
<bcp14>MUST NOT</bcp14>
insert Position-zero Read chunks in this list.
</dd>
<dt>rdma_provisional_writes:</dt>
<dd>
The rdma_writes field
anchors a list of zero or more provisional Write chunks.
</dd>
<dt>rdma_provisional_reply:</dt>
<dd>
The rdma_reply field is a list containing zero or one provisional Reply chunk.
</dd>
</dl>
</section>

<section
 anchor="section_780fd420-96d9-4cbe-af49-141ea89cdf6F"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_REPLY_EXTERNAL: Convey External RPC Reply Message</name>
<t>
RDMA2_REPLY_EXTERNAL conveys an RPC Reply message payload
using explicit RDMA operations.
In particular, it is referred to as a Special Format Reply
when the Responder writes the RPC payload into a memory area
specified by a Reply chunk.
The sender
<bcp14>MUST</bcp14>
set the rdma_xid field to the same value
as the xid of the RPC Reply message payload.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_REPLY_EXTERNAL = 11;

struct rpcrdma2_hdr_reply_external {
        struct rpcrdma2_write_list  *rdma_writes;
        struct rpcrdma2_write_chunk *rdma_reply;
};
]]>
</sourcecode>
<dl newline="false" spacing="normal">
<dt>rdma_writes:</dt>
<dd>
The rdma_writes field anchors a list of zero or more Write chunks
that are either empty or contain reduced data items.
</dd>
<dt>rdma_reply:</dt>
<dd>
The rdma_reply field is a list that
<bcp14>MUST</bcp14> contain exactly one Reply chunk.
</dd>
</dl>
</section>

<section
 anchor="section_3d6d56ae-ba9b-432f-9fc0-403a5f622fe5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_REPLY_MIDDLE: Convey Continued RPC Reply Message</name>
<t>
RDMA2_REPLY_MIDDLE conveys a beginning or middle portion
of an RPC Reply message
immediately following the transport header in the send buffer.
The sender
<bcp14>MUST</bcp14>
set the rdma_xid field to the same value
as the xid of the RPC Reply message payload.
The sender sets the rdma_remaining field
to the number of bytes in the RPC Call message payload
that remain to be sent.
The rdma_rpc_first_word field demarks the first word
of the Payload stream.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_REPLY_MIDDLE = 12;

struct rpcrdma2_hdr_reply_middle {
        uint32                      rdma_remaining;

        /* The rpc message starts here and continues
         * through the end of the transmission. */
        uint32                      rdma_rpc_first_word;
};
]]>
</sourcecode>
<t>
If a peer sends follows an RDMA2_REPLY_MIDDLE
header type with anything other than
an RDMA2_REPLY_MIDDLE message
or
an RDMA2_REPLY_INLINE message,
the receiver
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR header type
and
set its rdma_err field to RDMA2_ERR_INVAL_CONT
and drop the incoming message without processing it further.
</t>
</section>

<section
 anchor="section_67bf730f-38f1-40bc-8eec-1bedde7b0449"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_REPLY_INLINE: Convey RPC Reply Message Inline</name>
<t>
RDMA2_REPLY_INLINE conveys the only or final portion
of an RPC Reply message
immediately following the transport header in the send buffer.
If the Reply message payload has been reduced,
the rdma_chunks object carries the reduced data item chunks.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
const rpcrdma2_proc RDMA2_REPLY_INLINE = 13;

struct rpcrdma2_hdr_reply_inline {
        struct rpcrdma2_write_list  *rdma_writes;

        /* The rpc message starts here and continues
         * through the end of the transmission. */
        uint32                      rdma_rpc_first_word;
};
]]>
</sourcecode>
<dl newline="false" spacing="normal">
<dt>rdma_writes:</dt>
<dd>
The rdma_writes field anchors a list of zero or more Write chunks
that are either empty or contain reduced data items.
</dd>
</dl>
</section>

</section>

<section
 anchor="section_2d1735f0-c465-43c6-9c18-3da6b7979862"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Header Prefix</name>
<t>
The following prefix structure appears at the start of each
RPC-over-RDMA version 2 transport header.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
struct rpcrdma2_hdr_prefix {
        struct rpcrdma_common       rdma_start;
};
]]>
</sourcecode>
</section>

<section
 anchor="section_a957db67-a8fd-4886-b7a7-57382cfe3190"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Remote Invalidation</name>
<t>
To solicit the use of Remote Invalidation,
a Requester sets the value of the rdma_inv_handle field
in an RPC Call's transport header
to a non-zero value that matches
one of the rdma_handle fields in that header.
If the Responder may invalidate none of the rdma_handle values
in the header conveying the Call,
the Requester sets the RPC Call's rdma_inv_handle field to the value zero.
</t>
<t>
If the Responder chooses not to use remote invalidation
for this particular RPC Reply,
or the RPC Call's rdma_inv_handle field contains the value zero,
the Responder simply uses RDMA Send
to transmit the matching RPC reply.
However, if the Responder chooses to use Remote Invalidation,
it uses RDMA Send With Invalidate to transmit the RPC Reply.
It <bcp14>MUST</bcp14> use the value in the corresponding Call's rdma_inv_handle field
to construct the Send With Invalidate Work Request.
</t>
<t>
A Responder never uses a Send With Invalidate Work Request when sending
a control plane header type.
This includes
the RDMA2_ERROR header type,
the RDMA2_GRANT header type,
the RDMA2_CONNPROP_MIDDLE header type,
and
the RDMA2_CONNPROP_FINAL header type.
</t>
</section>

<section
 anchor="section_740e5b29-8c88-40ab-9506-69635d9a8167"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Payload Formats</name>
<t>
RPC-over-RDMA version 2 provides several ways,
known as "payload formats",
to convey an RPC-over-RDMA message.
A sender chooses the payload format for each message
based on several factors:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The existence of DDP-eligible data items in the RPC message payload
</li>
<li>
The size of the RPC message payload
</li>
<li>
The direction of the RPC message (i.e., Call or Reply)
</li>
<li>
The available hardware resources
</li>
<li>
The arrangement of source and sink memory buffers
</li>
</ul>
<t>
The following subsections describe in detail how
Requesters and Responders
format RPC-over-RDMA message payloads.
</t>

<section
 anchor="section_13a871a4-554f-4e3c-89ba-c5a75929f01E"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Simple Format</name>
<t>
All RPC messages conveyed via RPC-over-RDMA version 2
need at least one RDMA Send operation to convey.
Thus, the most efficient way to send an RPC message
that is smaller than the inline threshold
is to append the Payload stream directly to the Transport stream
and use an RDMA Send to convey both.
When no chunks are present, senders construct Calls and Replies the same way,
and no other operations are needed.
</t>

<section
 anchor="section_41c2df3b-e54d-4ccf-9eac-860f44dec2d2"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Simple Format with Data Item Chunks</name>
<t>
If DDP-eligible data items are present in a Payload stream,
a sender
<bcp14>MAY</bcp14>
reduce some or all of these items,
removing them from the Payload stream.
The sender then uses a separate mechanism to transfer the reduced data items.
The Transport stream immediately followed by
the reduced Payload stream
is then transferred using one RDMA Send operation.
</t>
<t>
When data item chunks are present,
senders construct Calls differently than Replies.
</t>
<dl newline="true" spacing="normal">
<dt>Simple Call</dt>
<dd>
After receiving the Transport and Payload streams
of an RPC Call message with Read chunks,
the Responder uses RDMA Read operations to move
the reduced data items contained in the Read chunks.
RPC-over-RDMA Calls can carry Write chunks
for the Responder to use when sending the matching Reply.
</dd>
<dt>Simple Reply</dt>
<dd>
The Responder uses RDMA Write operations
to move reduced data items contained in Write chunks.
Afterward, it sends the Transport and Payload streams
of the RPC Reply message using one RDMA Send.
RPC-over-RDMA Replies always carry an empty Read chunk list.
</dd>
</dl>
</section>

<section
 anchor="section_a690cf12-0e31-4df2-a32a-8cb8e2a3b2c8"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Simple Format Examples</name>
<figure>
<name>A Simple Call without data item chunks and a Simple Reply without data item chunks</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_31922ae6-4df5-489a-b424-52d72e2fe778"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_INLINE)      |
    Call |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
<figure>
<name>A Simple Call with a Read chunk and a Simple Reply without data item chunks</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_40f8c6f0-56c5-4fb8-ac67-41522a496d79"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_INLINE)      |
    Call |   ---------------------------------->   |
         |               RDMA Read                 |
         |   <----------------------------------   |
         |         RDMA Response (arg data)        |
         |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
<figure>
<name>A Simple Call without data item chunks and a Simple Reply with a Write chunk</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_25fddedc-8195-40ca-a33e-e8d13dd829d7"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_INLINE)      |
    Call |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |         RDMA Write (result data)        |
         |   <----------------------------------   |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
</section>

</section>

<section
 anchor="section_bda7bc14-2a3d-4224-873c-855912218987"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Continued Format</name>
<t>
For various reasons,
a sender can choose to split a message payload
over multiple RPC-over-RDMA messages.
The Payload stream of each RPC-over-RDMA message
contains a part of the RPC message.
The receiver reconstructs the original RPC message
by concatenating the Payload stream
of each RPC-over-RDMA message in received order.
A sender
<bcp14>MAY</bcp14>
split the Payload stream on any convenient boundary.
</t>

<section
 anchor="section_3b102dbf-81b2-4e0f-bb4b-d238db83ab44"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Continued Format with Data Item Chunks</name>
<t>
If DDP-eligible data items are present in the Payload stream,
a sender <bcp14>MAY</bcp14> reduce some or all of these items,
removing them from the Payload stream.
The sender then uses a separate mechanism to transfer the reduced data items.
The Transport stream immediately follwed by
the reduced Payload stream
is then transferred using one RDMA Send operation.
</t>
<t>
As with Simple Format messages, when chunks are present,
senders construct Calls differently than Replies.
</t>
<dl newline="true" spacing="normal">
<dt>Continued Call</dt>
<dd>
After receiving the Transport and Payload streams
of an RPC Call message with Read chunks,
the Responder uses RDMA Read operations to move
the reduced data items contained in Read chunks.
RPC-over-RDMA Calls can carry Write chunks
for the Responder to use when sending the matching Reply.
</dd>
<dt>Continued Reply</dt>
<dd>
The Responder uses RDMA Write operations
to move reduced data items contained in Write chunks.
Afterward, it sends the Transport and Payload streams
of the RPC Reply message using multiple RDMA Sends.
RPC-over-RDMA Replies always carry an empty Read chunk list.
</dd>
</dl>
</section>

<section
 anchor="section_49c71620-d55a-4867-801a-10928118befe"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Continued Format Examples</name>
<figure>
<name>A Continued Call without data item chunks and a Continued Reply without data item chunks</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_c7396c9d-8613-46c0-b7d0-b5641bc1f85a"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_MIDDLE)      |
    Call |   ---------------------------------->   |
         |      RDMA Send (RDMA2_CALL_MIDDLE)      |
         |   ---------------------------------->   |
         |      RDMA Send (RDMA2_CALL_INLINE)      |
         |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |      RDMA Send (RDMA2_REPLY_MIDDLE)     |
         |   <----------------------------------   | Reply
         |      RDMA Send (RDMA2_REPLY_MIDDLE)     |
         |   <----------------------------------   |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   |
]]>
</artwork>
</figure>
<figure>
<name>A Continued Call with a Read chunk and a Simple Reply without data item chunks</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_4eece307-8a67-4b5b-a69d-ab0d37f9467f"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_MIDDLE)      |
    Call |   ---------------------------------->   |
         |      RDMA Send (RDMA2_CALL_MIDDLE)      |
         |   ---------------------------------->   |
         |      RDMA Send (RDMA2_CALL_INLINE)      |
         |   ---------------------------------->   |
         |              RDMA Read                  |
         |   <----------------------------------   |
         |         RDMA Response (arg data)        |
         |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
<figure>
<name>A Simple Call without data item chunks and a Continued Reply with a Write chunk</name>
<artwork
 align="left"
 alt=""
 anchor="artwork_4453c7eb-318c-42c8-befb-67f6a251f430"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_INLINE)      |
    Call |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |         RDMA Write (result data)        |
         |   <----------------------------------   |
         |      RDMA Send (RDMA2_REPLY_MIDDLE)     |
         |   <----------------------------------   | Reply
         |      RDMA Send (RDMA2_REPLY_MIDDLE)     |
         |   <----------------------------------   |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   |
]]>
</artwork>
</figure>
</section>

</section>

<section
 anchor="section_d9731520-bdde-4a1b-9f54-9901d5c57648"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Special Format</name>
<t>
Even after DDP-eligible data items have been removed,
a Payload stream can sometimes be too large
to send using only RDMA Send operations.
In those cases, the sender can use RDMA Read or Write operations
to convey the entire RPC message.
We refer to this as a "Special Format" message.
</t>
<t>
To transmit a Special Format message,
the sender transmits only the Transport stream
with an RDMA Send operation.
The sender does not include the Payload stream in the send buffer.
Instead, the Requester provides a body chunk that the Responder
uses to move the Payload stream.
</t>
<t>
Because chunks are always present in Special Format messages,
the sender always handles Calls and Replies differently.
</t>
<dl newline="true" spacing="normal">
<dt>Special Call</dt>
<dd>
The Requester provides a Read chunk
that contains the RPC Call message's Payload stream.
Every Read segment in this chunk
<bcp14>MUST</bcp14>
contain zero (0) in its Position field.
This type of Read chunk is a body chunk known as a Call chunk.
</dd>
<dt>Special Reply</dt>
<dd>
The Requester provisions a Reply chunk in advance.
This body chunk is a Write chunk into which
the Responder places the RPC Reply message's Payload stream.
The Requester provisions the Reply chunk to accommodate
the maximum expected reply size for that upper-layer operation.
</dd>
</dl>
<t>
One purpose of a Special Format message is to handle large RPC messages.
However, Requesters
<bcp14>MAY</bcp14>
use a Special Format message at any time to convey an RPC Call message.
</t>
<t>
When it has alternatives,
a Responder chooses which Format to use based on the chunks
provided by the Requester.
If a Requester provided a Write chunk
and
the Responder has a DDP-eligible result,
it first reduces the reply Payload stream.
If a Requester provided a Reply chunk
and
the reduced Payload stream is larger than the reply inline threshold,
the Responder
<bcp14>MUST</bcp14>
use the Requester-provided Reply chunk for the reply.
</t>

<section
 anchor="section_199de5c1-e307-4664-abeb-dd687b4329c3"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Special Format Examples</name>
<figure>
<name>A Special Call and a Simple Reply without data item chunks</name>
<artwork
 align="left"
 alt=""
 anchor="anchor_fd75308e-ce7d-4670-a0b0-9ee84d8f6411"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |     RDMA Send (RDMA2_CALL_EXTERNAL)     |
    Call |   ---------------------------------->   |
         |               RDMA Read                 |
         |   <----------------------------------   |
         |         RDMA Response (RPC call)        |
         |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |      RDMA Send (RDMA2_REPLY_INLINE)     |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
<figure>
<name>A Simple Call without data item chunks and a Special Reply</name>
<artwork
 align="left"
 alt=""
 anchor="anchor_12f3b386-b806-41b4-a472-eb216266523e"
 name=""
 type="call-flow">
<![CDATA[
     Requester                                 Responder
         |      RDMA Send (RDMA2_CALL_INLINE)      |
    Call |   ---------------------------------->   |
         |                                         |
         |                                         | Processing
         |                                         |
         |          RDMA Write (RPC reply)         |
         |   <----------------------------------   |
         |     RDMA Send (RDMA2_REPLY_EXTERNAL)    |
         |   <----------------------------------   | Reply
]]>
</artwork>
</figure>
</section>

</section>

<section
 anchor="section_19fc8c94-d83f-4c1b-8ce3-700918d129b5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Choosing a Reply Payload Format</name>
<t>
A Requester provisions all necessary registered memory resources
for both an RPC Call and its matching RPC Reply.
A Requester constructs each RPC Call, thus it can compute
the exact memory resources needed to send every Call.
However, the Requester allocates memory resources
to receive the corresponding Reply before the Responder has constructed it.
Occasionally, it is challenging for the Requester
to know in advance precisely what resources are needed to receive the Reply.
</t>
<t>
In RPC-over-RDMA version 2,
a Requester can provide a Reply chunk for any transaction.
The Responder can use the provided Reply chunk
or
it can decide to use another means to convey the RPC Reply.
If
the combination of the provided Write chunk list
and
Reply chunk
is not adequate to convey a Reply,
the Responder <bcp14>SHOULD</bcp14> use Message Continuation
to send that Reply.
If even that is not possible,
the Responder sends an RDMA2_ERROR message to the Requester,
as described in
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="default" sectionFormat="of"/>:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
If the Write chunk list cannot accommodate the ULP's DDP-eligible data payload,
the Responder sends an RDMA2_ERR_WRITE_RESOURCE error.
</li>
<li>
If the Reply chunk cannot accommodate the parts of the Reply that are not DDP-eligible,
the Responder sends an RDMA2_ERR_REPLY_RESOURCE error.
</li>
</ul>
<t>
When receiving such errors,
the Requester can retry the ULP call using more substantial reply resources.
In cases where retrying the ULP request is not possible
(e.g., the request is non-idempotent),
the Requester terminates the RPC transaction
and presents an error to the RPC consumer.
</t>
</section>

</section>

</section>

<section
 anchor="section_995a7597-4e89-48c3-b142-35b783ef1329"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Error Handling</name>
<t>
A receiver performs validity checks
on each ingress RPC-over-RDMA message
before it assembles that message's Payload stream
and
passes it to the RPC layer.
For example, if an ingress RPC-over-RDMA message is not as long as
the size of struct rpcrdma2_hdr_prefix (20 octets),
the receiver cannot trust the value of the rdma_xid field.
In this case, the receiver <bcp14>MUST</bcp14> silently discard the ingress message
without processing it further, and without a response to the sender.
</t>
<t>
When a request
(for instance, an RPC Call or a control plane operation)
is made,
typically an RPC consumer blocks while waiting for the response.
Thus when an incoming message conveys a request
and that request cannot be acted upon,
the receiver of that request needs
to report the problem to its sender
in order to unblock waiters.
Likewise, if, after processing a request,
a sender is unable to transmit the response
on an otherwise healthy connection,
the sender needs to report that problem
for the same reason.
</t>
<t>
The RDMA2_ERROR header type is used for this purpose.
To form an RDMA2_ERROR type header:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The rdma_xid field <bcp14>MUST</bcp14> contain the same XID
that was in the rdma_xid field in the ingress request.
</li>
<li>
The rdma_vers field <bcp14>MUST</bcp14> contain the same version
that was in the rdma_vers field in the ingress request.
</li>
<li>
The sender sets the rdma_credit field to the credit values in effect
for this connection.
</li>
<li>
The rdma_htype field <bcp14>MUST</bcp14> contain the value RDMA2_ERROR.
</li>
<li>
The rdma_err field contains a value
that reflects the type of error that occurred,
as described in the subsections below.
</li>
</ul>
<t>
When a peer receives an RDMA2_ERROR message type
with an unrecognized or unsupported value in its rdma_err field,
it <bcp14>MUST</bcp14> silently discard the message without processing it further.
</t>

<section
 anchor="section_c445fd74-d6c2-4f64-a215-844c84da4b6b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Basic Transport Stream Parsing Errors</name>

<section
 anchor="section_4a5e9013-9c26-4776-afbd-95318cb2ea8a"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_VERS</name>
<t>
When a Responder detects an RPC-over-RDMA header version
that it does not support
(the current document defines version 2),
it <bcp14>MUST</bcp14> respond with an RDMA2_ERROR message type
and
set its rdma_err field to RDMA2_ERR_VERS.
The Responder then fills in the rpcrdma2_err_vers structure
with the RPC-over-RDMA versions it supports.
The Responder <bcp14>MUST</bcp14> silently discard the ingress message
without passing it to the RPC layer.
</t>
<t>
When a Requester receives this error message,
it uses the information in the rpcrdma2_err_vers structure
to select an RPC-over-RDMA version that both peers support
for subsequent operations on the connection.
A Requester <bcp14>MUST NOT</bcp14> subsequently send a message that uses
a version that the Responder has indicated it does not support.
RDMA2_ERR_VERS indicates a permanent error.
Receipt of this error completes the RPC transaction
associated with XID in the rdma_xid field.
</t>
</section>

<section
 anchor="section_97a4dbec-8b72-43f8-9ebb-e822ec3ad713"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_VERS_MISMATCH</name>
<t>
When a Responder receives a message
with a transport protocol version
that does not match the protocol version
that was used in previous successful exchanges on the same connection,
it
<bcp14>MUST</bcp14>
respond with an RDMA2_ERROR message type
and
set its rdma_err field to RDMA2_ERR_VERS_MISMATCH.
The Responder
<bcp14>MUST</bcp14>
silently discard the ingress message without passing it to the RPC layer.
</t>
<t>
A Requester <bcp14>MUST NOT</bcp14> subsequently send a message
that uses a protocol version
that the Responder has indicated it does not recognize on this connection.
The Requester can recover by sending the message again
using a corrected protocol version,
or
it can terminate the RPC transaction associated with the XID
in the rdma_xid field with an error.
</t>
</section>

<section
 anchor="section_9754ea3a-d237-41cf-9739-99843e11e524"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_INVAL_HTYPE</name>
<t>
If a Responder recognizes the value in an ingress rdma_vers field,
but it does not recognize the value in the rdma_htype field
or
does not support that header type,
it <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_INVAL_HTYPE.
The Responder <bcp14>MUST</bcp14> silently discard the incoming message
without passing it to the RPC layer.
</t>
<t>
A Requester <bcp14>MUST NOT</bcp14> subsequently send a message
on the connection that uses
an htype that the Responder has indicated it does not support.
RDMA2_ERR_INVAL_HTYPE indicates a permanent error.
Receipt of this error completes the RPC transaction
associated with XID in the rdma_xid field.
</t>
</section>

<section
 anchor="section_37afd7d4-5d72-4300-ad6d-95558cdd5e19"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_INVAL_CONT</name>
<t>
If a Responder detects a problem with an ingress RPC-over-RDMA message
that is part of a Message Continuation sequence,
the Responder <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_INVAL_CONT.
The Responder <bcp14>MUST</bcp14> silently discard all ingress messages
with an rdma_xid field that matches the failing message
without reassembling the payload.
</t>
<t>
RDMA2_ERR_INVAL_CONT indicates a permanent error.
Receipt of this error completes the RPC transaction
associated with XID in the rdma_xid field.
</t>
</section>

</section>

<section
 anchor="section_932d8a55-30e3-412c-8b78-355606742861"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Errors</name>
<t>
A receiver might encounter an XDR parsing error that
prevents it from processing an ingress Transport stream.
Examples of such errors include:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The value of the rdma_xid field does not match
the value of the XID field in the accompanying RPC
message.
</li>
<li>
The receive buffer ends before the end of a data
object contained in the Transport stream.
</li>
</ul>
<t>
Moreover, when a Responder receives a valid RPC-over-RDMA header
but the Responder's ULP implementation cannot parse the RPC arguments
in the RPC Call,
the Responder returns an RPC Reply with status GARBAGE_ARGS,
using an RDMA2_REPLY_INLINE message type.
This type of parsing failure might be due to mismatches
between chunk sizes or offsets
and
the contents of the Payload stream, for example.
In this case, the error is permanent,
but the Requester has no way to know how much processing
the Responder has completed for this RPC transaction.
</t>

<section
 anchor="section_90449e5c-e65a-48c0-80bf-e9080d67094e"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_BAD_XDR</name>
<t>
If a Responder recognizes the values in the rdma_vers field,
but it cannot otherwise parse the ingress Transport stream,
it <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_BAD_XDR.
The Responder <bcp14>MUST</bcp14> silently discard the ingress message
without passing it to the RPC layer.
</t>
<t>
RDMA2_ERR_BAD_XDR indicates a permanent error.
Receipt of this error completes the RPC transaction
associated with XID in the rdma_xid field.
</t>
</section>

<section
 anchor="section_4a3d9ac4-4083-46d1-99b4-76a0b1ac93bf"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_BAD_PROPVAL</name>
<t>
If a receiver recognizes the value in an ingress rdma_which field,
but it cannot parse the accompanying propval,
it <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_BAD_PROPVAL (see
<xref target="section_d5ac12f6-6735-48f3-b4ba-b44a19ff9298" format="default" sectionFormat="of"/>).
The receiver <bcp14>MUST</bcp14> silently discard the ingress message
without applying any of its property settings.
</t>
</section>

</section>

<section
 anchor="section_87c5f543-7092-4faf-b13e-0e994e8023a7"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Responder RDMA Operational Errors</name>
<t>
In RPC-over-RDMA version 2,
the Responder initiates RDMA Read and Write operations
that target the Requester's memory.
Problems might arise as the Responder attempts
to use Requester-provided resources for RDMA operations.
For example:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
Usually, chunks can be validated only by using their contents to perform data transfers.
If chunk contents are invalid
(e.g., a memory region is no longer registered
or
a chunk length exceeds the end of the registered memory region),
a Remote Access Error occurs.
</li>
<li>If a Requester's Receive buffer is too small,
the Responder's Send operation completes with a Local Length Error.
</li>
<li>
If the Requester-provided Reply chunk is too small
to accommodate a large RPC Reply message,
a Remote Access Error occurs.
A Responder might detect this problem before attempting to write past the end of the Reply chunk.
</li>
</ul>
<t>
RDMA operational errors can be fatal to the connection.
To avoid a retransmission loop and repeated connection loss that deadlocks the connection,
once the Requester has re-established a connection,
the Responder
<bcp14>SHOULD</bcp14>
send an RDMA2_ERROR response
to indicate that no RPC-level reply is possible for that transaction.
</t>

<section
 anchor="section_77a9225b-7b43-4899-b9d1-5df14310a144"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_READ_CHUNKS</name>
<t>
If a Requester presents more DDP-eligible arguments
than a Responder is prepared to Read,
the Responder <bcp14>MUST</bcp14>
set the rdma_err field to RDMA2_ERR_READ_CHUNKS
and
set the rdma_max_chunks field
to the maximum number of Read chunks the Responder can process.
If the Responder implementation cannot handle any Read chunks for a request,
it <bcp14>MUST</bcp14> set the rdma_max_chunks to zero in this response.
The Responder <bcp14>MUST</bcp14> silently discard the ingress message
without processing it further.
</t>
<t>
The Requester can reconstruct the Call using
Message Continuation
or
a Special Format payload
and resend it.
If the Requester chooses not to resend the Call,
it <bcp14>MUST</bcp14> terminate this RPC transaction with an error.
</t>
</section>

<section
 anchor="section_61617957-ee18-4d10-bf47-3b932739eee4"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_WRITE_CHUNKS</name>
<t>
If a Requester has constructed an RPC Call
with more DDP-eligible results than
the Responder is prepared to Write,
the Responder <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS
and
set the rdma_max_chunks field
to the maximum number of Write chunks the Responder can return.
The Requester can reconstruct the Call
with no Write chunks
and
a Reply chunk of appropriate size.
If the Requester does not resend the Call,
it <bcp14>MUST</bcp14> terminate this RPC transaction with an error.
</t>
<t>
If the Responder implementation cannot handle any Write chunks
for a request
and
cannot send the Reply using Message Continuation,
it <bcp14>MUST</bcp14> return a response of RDMA2_ERR_REPLY_RESOURCE instead (see below).
</t>
</section>

<section
 anchor="section_ff471562-5a8f-4dbd-8f37-ac41c5587b93"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_SEGMENTS</name>
<t>
If a Requester has constructed an RPC Call
with a chunk that contains more segments than the Responder supports,
the Responder <bcp14>MUST</bcp14>
set the rdma_err field to RDMA2_ERR_SEGMENTS
and
set the rdma_max_segments field
to the maximum number of segments the Responder can process.
The Requester can reconstruct the Call and resend it.
If the Requester does not resend the Call,
it <bcp14>MUST</bcp14> terminate this RPC transaction with an error.
</t>
</section>

<section
 anchor="section_229512aa-cd65-4d5e-b66e-65a4ed3731cc"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_WRITE_RESOURCE</name>
<t>
If a Requester has provided a Write chunk
that is not large enough to contain a DDP-eligible result,
the Responder <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_WRITE_RESOURCE.
The Responder <bcp14>MUST</bcp14> set the rdma_chunk_index field
to point to the first Write chunk in the transport header
that is too short,
or to zero to indicate that it was not possible
to determine which chunk is too small.
Indexing starts at one (1), which represents the first Write chunk.
The Responder <bcp14>MUST</bcp14> set the rdma_length_needed to the number of bytes
needed in that chunk to convey the result data item.
</t>
<t>
The Requester can reconstruct the Call with more reply resources
and resend it.
If the Requester does not resend the Call
(for instance, if the Responder set the index and length fields to zero),
it <bcp14>MUST</bcp14> terminate this RPC transaction with an error.
</t>
</section>

<section
 anchor="section_7b2d20ad-072e-4b9c-8a16-e9a28009ae6b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_REPLY_RESOURCE</name>
<t>
If a Responder cannot send an RPC Reply
using Message Continuation
and
the Reply does not fit in the Reply chunk,
the Responder <bcp14>MUST</bcp14> set the rdma_err field to RDMA2_ERR_REPLY_RESOURCE.
The Responder <bcp14>MUST</bcp14> set the rdma_length_needed
to the number of Reply chunk bytes needed to convey the reply.
The Requester can reconstruct the Call with more reply resources
and resend it.
If the Requester does not resend the Call
(for instance, if the Responder set the length field to zero),
it <bcp14>MUST</bcp14> terminate this RPC transaction with an error.
</t>
</section>

</section>

<section
 anchor="section_125537e3-1b73-465f-9602-c3986df297d6"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Other Operational Errors</name>
<t>
While a Requester is constructing an RPC Call message,
an unrecoverable problem might occur
that prevents the Requester from posting further RDMA Work Requests on behalf of that message.
As with other transports,
if a Requester is unable to construct and transmit an RPC Call,
the associated RPC transaction fails immediately.
</t>
<t>
After a Requester has received a Reply,
if it is unable to invalidate a memory region due to an unrecoverable problem,
the Requester <bcp14>MUST</bcp14> close the connection to protect that memory
from Responder access before the associated RPC transaction is complete.
</t>
<t>
While a Responder is constructing an RPC Reply message or error message,
an unrecoverable problem might occur that prevents the Responder
from posting further RDMA Work Requests on behalf of that message.
If a Responder is unable to construct and transmit
an RPC Reply
or
RPC-over-RDMA error message,
the Responder <bcp14>MUST</bcp14> close the connection
to signal to the Requester that a reply was lost.
</t>

<section
 anchor="section_8a40d281-66fb-4f61-a686-190624aa4001"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA2_ERR_SYSTEM</name>
<t>
If some problem occurs on a Responder
that does not fit into the above categories,
the Responder <bcp14>MAY</bcp14> report it to the Requester
by setting the rdma_err field to RDMA2_ERR_SYSTEM.
The Responder <bcp14>MUST</bcp14> silently discard the message(s)
associated with the failing transaction
without further processing.
</t>
<t>
RDMA2_ERR_SYSTEM is a permanent error.
This error does not indicate
how much of the transaction the Responder has processed,
nor does it indicate a particular recovery action for the Requester.
A Requester that receives this error <bcp14>MUST</bcp14> terminate the RPC transaction
associated with the XID value in the RDMA2_ERROR message's rdma_xid field.
</t>
</section>

</section>

<section
 anchor="section_1b7beb7d-b694-4309-95ee-f81a05637ef6"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RDMA Transport Errors</name>
<t>
The RDMA connection and physical link
provide some degree of error detection and retransmission.
The Marker PDU Aligned Framing (MPA) protocol (as described in
<xref target="RFC5044" section="7.1" format="default" sectionFormat="of"/>)
as well as the InfiniBand link layer
<xref target="IBA" format="default" sectionFormat="of"/>
provide Cyclic Redundancy Check (CRC) protection of RDMA payloads.
CRC-class protection is a general attribute of such transports.
</t>
<t>
Additionally, the RPC layer itself can accept errors
from the transport and recover via retransmission.
RPC recovery can typically handle
complete loss and re-establishment of a transport connection.
</t>
<t>
The details of reporting and recovery from RDMA link-layer errors
are described in
specific link-layer APIs and operational specifications
and are outside the scope of this protocol specification.
See
<xref target="section_912a2c09-95ec-4cb6-aa2b-2245726d9edf" format="default" sectionFormat="of"/>
for further discussion of RPC-level integrity schemes.
</t>
</section>

</section>

<section
 anchor="section_bf53e759-d97f-487d-a5e2-9b8153db1803"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Protocol Definition</name>
<t>
This section contains a description of the core features
of the RPC-over-RDMA version 2 protocol
expressed in the XDR language
<xref target="RFC4506" format="default" sectionFormat="of"/>.
It organizes the description to make it simple
to extract into a form that is
ready to compile
or
combine with similar descriptions published later
as extensions to RPC-over-RDMA version 2.
</t>

<section
 anchor="section_aaab9699-eae3-46ca-a1d5-a8776a5ecb7d"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Code Component License</name>
<t>
Code Components extracted from the current document
must include the following license text.
When combining the extracted XDR code
with other XDR code which has an identical license,
only a single copy of the license text needs to be retained.
</t>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
/// /*
///  * Copyright (c) 2010, 2020 IETF Trust and the persons
///  * identified as authors of the code.  All rights reserved.
///  *
///  * The authors of the code are:
///  * B. Callaghan, T. Talpey, C. Lever, and D. Noveck.
///  *
///  * Redistribution and use in source and binary forms, with
///  * or without modification, are permitted provided that the
///  * following conditions are met:
///  *
///  * - Redistributions of source code must retain the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer.
///  *
///  * - Redistributions in binary form must reproduce the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer in the documentation and/or other
///  *   materials provided with the distribution.
///  *
///  * - Neither the name of Internet Society, IETF or IETF
///  *   Trust, nor the names of specific contributors, may be
///  *   used to endorse or promote products derived from this
///  *   software without specific prior written permission.
///  *
///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
///  */
///
]]>
</sourcecode>
</section>

<section
 anchor="section_a288b3e6-5e73-412d-91e8-f87c031cb05b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Extraction of the XDR Definition</name>
<t>
Implementers can apply the following sed script
to the current document to produce
a machine-readable XDR description
of the base RPC-over-RDMA version 2 protocol.
</t>
<sourcecode name="" type="" markers="true">
<![CDATA[
sed -n -e 's:^ */// ::p' -e 's:^ *///$::p'
]]>
</sourcecode>
<t>
That is, if this document is in a file called "spec.txt",
then implementers can do the following
to extract an XDR description file
and store it in the file rpcrdma-v2.x.
</t>
<sourcecode name="" type="" markers="true">
<![CDATA[
sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \
 < spec.txt > rpcrdma-v2.x
]]>
</sourcecode>
<t>
Although this file is a usable description of the base protocol,
when extensions are to be supported,
it may be desirable to divide the description into multiple files.
The following script achieves that purpose:
</t>
<sourcecode name="" type="perl" markers="true">
<![CDATA[
#!/usr/local/bin/perl
open(IN,"rpcrdma-v2.x");
open(OUT,">temp.x");
while(<IN>)
{
  if (m/FILE ENDS: (.*)$/)
    {
      close(OUT);
      rename("temp.x", $1);
      open(OUT,">temp.x");
    }
    else
    {
      print OUT $_;
    }
}
close(IN);
close(OUT);
]]>
</sourcecode>
<t>
Running the above script results in two files:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The file common.x, containing the license
plus the shared XDR definitions
that need to be made available
to both the base protocol and any subsequent extensions.
</li>
<li>
The file baseops.x containing the XDR definitions
for the base protocol defined in this document.
</li>
</ul>
<t>
Extensions to RPC-over-RDMA version 2,
published as Standards Track documents,
should have similarly structured XDR definitions.
Once an implementer has extracted
the XDR for all desired extensions
and
the base XDR definition contained in the current document,
she can concatenate them to produce
a consolidated XDR definition
that reflects the set of extensions
selected for her RPC-over-RDMA version 2 implementation.
</t>
<t>
Alternatively, the XDR descriptions can be compiled separately.
In that case, the combination of common.x and baseops.x
defines the base transport.
The combination of common.x and the XDR description
of each extension produces a full XDR definition of that extension.
</t>
</section>

<section
 anchor="section_b25ffcfc-511f-4383-8025-4a68cfcb4f49"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Definition for RPC-over-RDMA Version 2 Core Structures</name>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
/// /***************************************************************
///  *    Transport Header Prefixes
///  ***************************************************************/
///
/// struct rpcrdma_common {
///         uint32         rdma_xid;
///         uint32         rdma_vers;
///         uint32         rdma_credit;
///         uint32         rdma_htype;
/// };
///
/// struct rpcrdma2_hdr_prefix {
///         struct rpcrdma_common       rdma_start;
/// };
///
/// /***************************************************************
///  *    Chunks and Chunk Lists
///  ***************************************************************/
///
/// struct rpcrdma2_segment {
///         uint32 rdma_handle;
///         uint32 rdma_length;
///         uint64 rdma_offset;
/// };
///
/// struct rpcrdma2_read_segment {
///         uint32                  rdma_position;
///         struct rpcrdma2_segment rdma_target;
/// };
///
/// struct rpcrdma2_read_list {
///         struct rpcrdma2_read_segment rdma_entry;
///         struct rpcrdma2_read_list    *rdma_next;
/// };
///
/// struct rpcrdma2_write_chunk {
///         struct rpcrdma2_segment rdma_target<>;
/// };
///
/// struct rpcrdma2_write_list {
///         struct rpcrdma2_write_chunk rdma_entry;
///         struct rpcrdma2_write_list  *rdma_next;
/// };
///
/// /***************************************************************
///  *    Transport Properties
///  ***************************************************************/
///
/// /*
///  * Types for transport properties model
///  */
/// typedef rpcrdma2_propid uint32;
///
/// struct rpcrdma2_propval {
///         rpcrdma2_propid rdma_which;
///         opaque          rdma_data<>;
/// };
///
/// typedef rpcrdma2_propval rpcrdma2_propset<>;
/// typedef uint32 rpcrdma2_propsubset<>;
///
/// /*
///  * Transport propid values for basic properties
///  */
/// const RDMA2_PROPID_SBSIZ = 1;
/// const RDMA2_PROPID_RBSIZ = 2;
/// const RDMA2_PROPID_RSSIZ = 3;
/// const RDMA2_PROPID_RCSIZ = 4;
/// const RDMA2_PROPID_BRS = 5;
/// const RDMA2_PROPID_HOSTAUTH = 6;
///
/// /*
///  * Types specific to particular properties
///  */
/// typedef uint32 rpcrdma2_prop_sbsiz;
/// typedef uint32 rpcrdma2_prop_rbsiz;
/// typedef uint32 rpcrdma2_prop_rssiz;
/// typedef uint32 rpcrdma2_prop_rcsiz;
/// typedef uint32 rpcrdma2_prop_brs;
/// typedef opaque rpcrdma2_prop_hostauth<>;
///
/// const RDMA2_RVRSDIR_NONE = 0;
/// const RDMA2_RVRSDIR_SIMPLE = 1;
/// const RDMA2_RVRSDIR_CONT = 2;
/// const RDMA2_RVRSDIR_GENL = 3;
///
/// /* FILE ENDS: common.x; */

]]>
</sourcecode>
</section>

<section
 anchor="section_84e950a5-c842-4d19-b56d-0458c3e219b2"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>XDR Definition for RPC-over-RDMA Version 2 Base Header Types</name>
<sourcecode name="" type="xdr" markers="true">
<![CDATA[
/// /***************************************************************
///  *    Descriptions of RPC-over-RDMA Header Types
///  ***************************************************************/
///
/// /*
///  * Header Type Codes: Control plane operations.
///  */
/// const RDMA2_ERROR = 4;
/// const RDMA2_GRANT = 5;
/// const RDMA2_CONNPROP_MIDDLE = 6;
/// const RDMA2_CONNPROP_FINAL = 7;
///
/// /*
///  * Header Type Codes: Call messages.
///  */
/// const RDMA2_CALL_EXTERNAL = 8;
/// const RDMA2_CALL_MIDDLE = 9;
/// const RDMA2_CALL_INLINE = 10;
///
/// /*
///  * Header Type Codes: Reply messages.
///  */
/// const RDMA2_REPLY_EXTERNAL = 11;
/// const RDMA2_REPLY_MIDDLE = 12;
/// const RDMA2_REPLY_INLINE = 13;
///
/// /*
///  * Header Type to Report Errors.
///  */
/// const RDMA2_ERR_VERS = 1;
/// const RDMA2_ERR_BAD_XDR = 2;
/// const RDMA2_ERR_BAD_PROPVAL = 3;
/// const RDMA2_ERR_INVAL_HTYPE = 4;
/// const RDMA2_ERR_INVAL_CONT = 5;
/// const RDMA2_ERR_READ_CHUNKS = 6;
/// const RDMA2_ERR_WRITE_CHUNKS = 7;
/// const RDMA2_ERR_SEGMENTS = 8;
/// const RDMA2_ERR_WRITE_RESOURCE = 9;
/// const RDMA2_ERR_REPLY_RESOURCE = 10;
/// const RDMA2_ERR_VERS_MISMATCH = 11;
/// const RDMA2_ERR_SYSTEM = 100;
///
/// struct rpcrdma2_err_vers {
///         uint32 rdma_vers_low;
///         uint32 rdma_vers_high;
/// };
///
/// struct rpcrdma2_err_write {
///         uint32 rdma_chunk_index;
///         uint32 rdma_length_needed;
/// };
///
/// union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) {
///         case RDMA2_ERR_VERS:
///           rpcrdma2_err_vers rdma_vrange;
///         case RDMA2_ERR_READ_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_WRITE_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_SEGMENTS:
///           uint32 rdma_max_segments;
///         case RDMA2_ERR_WRITE_RESOURCE:
///           rpcrdma2_err_write rdma_writeres;
///         case RDMA2_ERR_REPLY_RESOURCE:
///           uint32 rdma_length_needed;
///         default:
///           void;
/// };
///
/// /*
///  * Header Type to Exchange Transport Properties.
///  */
/// struct rpcrdma2_hdr_connprop {
///         rpcrdma2_propset rdma_props;
/// };
///
/// /*
///  * Header Types to Convey RPC Messages.
///  */
/// struct rpcrdma2_hdr_call_external {
///         uint32                      rdma_inv_handle;
///
///         struct rpcrdma2_read_list   *rdma_call;
///         struct rpcrdma2_read_list   *rdma_reads;
///         struct rpcrdma2_write_list  *rdma_provisional_writes;
///         struct rpcrdma2_write_chunk *rdma_provisional_reply;
/// };
///
/// struct rpcrdma2_hdr_call_middle {
///         uint32                      rdma_remaining;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                      rdma_rpc_first_word;
/// };
///
/// struct rpcrdma2_hdr_call_inline {
///         uint32                      rdma_inv_handle;
///
///         struct rpcrdma2_read_list   *rdma_reads;
///         struct rpcrdma2_write_list  *rdma_provisional_writes;
///         struct rpcrdma2_write_chunk *rdma_provisional_reply;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                      rdma_rpc_first_word;
/// };
///
/// struct rpcrdma2_hdr_reply_external {
///         struct rpcrdma2_write_list  *rdma_writes;
///         struct rpcrdma2_write_chunk *rdma_reply;
/// };
///
/// struct rpcrdma2_hdr_reply_middle {
///         uint32                      rdma_remaining;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                      rdma_rpc_first_word;
/// };
///
/// struct rpcrdma2_hdr_reply_inline {
///         struct rpcrdma2_write_list  *rdma_writes;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                      rdma_rpc_first_word;
/// };
///
/// /* FILE ENDS: baseops.x; */

]]>
</sourcecode>
</section>

<section
 anchor="section_5541f0da-efbb-4431-af9c-6f82aa773963"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Use of the XDR Description</name>
<t>
The files common.x and baseops.x,
when combined with the XDR descriptions for extension defined later,
produce a human-readable and compilable description
of the RPC-over-RDMA version 2 protocol with the included extensions.
</t>
<t>
Although this XDR description can generate encoders and decoders
for the Transport and Payload streams,
there are elements of the operation of RPC-over-RDMA version 2
that cannot be expressed within the XDR language.
Implementations that use the output of an automated XDR processor
need to provide additional code to bridge these gaps.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The Transport stream is not a single XDR object.
Instead, the header prefix is one XDR data item,
and the rest of the header is a separate XDR data item.
<xref target="table_b5c31bf9-d623-4957-97db-29fc1d416cb8" format="default" sectionFormat="of"/>
expresses the mapping between the header type in the header prefix
and
the XDR object representing the header type.
</li>
<li>
The relationship between
the Transport stream
and
the Payload stream
is not specified using XDR.
Comments within the XDR text
make clear where transported messages,
described by their own XDR definitions,
need to appear.
Such data is opaque to the transport.
</li>
<li>
Continuation of RPC messages across transport message boundaries
requires that message assembly facilities
not specifiable within XDR are part of transport implementations.
</li>
<li>
Transport properties are constant integer values.
<xref target="table_99d0e7cc-da81-4f16-9bd0-471f806bc0b6" format="default" sectionFormat="of"/>
expresses the mapping between
each property's code point
and
the XDR typedef that represents the structure of the property's value.
XDR does not possess the facility to express that mapping in an extensible way.
</li>
</ul>
<t>
The role of XDR in RPC-over-RDMA specifications
is more limited than for protocols
where the totality of the protocol is expressible within XDR.
XDR lacks the facility to represent
the embedding of XDR-encoded payload material.
Also, the need to cleanly accommodate extensions
has meant that those using rpcgen in their applications
need to take an active role to provide
the facilities that cannot be expressed within XDR.
</t>
</section>

</section>

<section
 anchor="section_e914de0a-05f3-4e14-a067-fb49a4f9b0ad"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC Bind Parameters</name>
<t>
Before establishing a new connection,
an RPC client obtains a transport address for the RPC server.
The means used to obtain this address
and to open an RDMA connection is dependent
on the type of RDMA transport
and is the responsibility of each RPC protocol binding
and its local implementation.
</t>
<t>
RPC services typically register with a portmap or rpcbind service
<xref target="RFC1833" format="default" sectionFormat="of"/>,
which associates an RPC Program number with a service address.
This policy is no different with RDMA transports.
However, a distinct service address (port number)
is sometimes required for operation on RPC-over-RDMA.
</t>
<t>
When mapped atop MPA
<xref target="RFC5044" format="default" sectionFormat="of"/>,
which uses IP port addressing due to its layering on TCP or SCTP,
port mapping is trivial
and consists merely of issuing the port in the connection process.
The NFS/RDMA protocol service address
has been assigned port 20049 by IANA
for this deployment scenario
<xref target="RFC8267" format="default" sectionFormat="of"/>.
</t>
<t>
When mapped atop InfiniBand
<xref target="IBA" format="default" sectionFormat="of"/>,
which uses a service endpoint naming scheme based on a Group Identifier (GID),
a translation <bcp14>MUST</bcp14> be employed.
One such translation is described in
Annexes A3 (Application Specific Identifiers),
A4 (Sockets Direct Protocol (SDP)),
and A11 (RDMA IP CM Service) of
<xref target="IBA" format="default" sectionFormat="of"/>,
which is appropriate for translating IP port addressing
to the InfiniBand network.
Therefore, in this case,
IP port addressing may be readily employed by the upper layer.
</t>
<t>
When a mapping standard or convention exists
for IP ports on an RDMA interconnect,
there are several possibilities for each upper layer to consider:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
One possibility is to have the server register
its mapped IP port with the rpcbind service
under the netid (or netids) defined in
<xref target="RFC8166" format="default" sectionFormat="of"/>.
An RPC-over-RDMA-aware RPC client can then
resolve its desired service to a mappable port
and
proceed to connect.
This method is the most flexible and compatible approach
for those upper layers that are defined to use the rpcbind service.
</li>
<li>
A second possibility is to have the RPC server's portmapper
register itself on the RDMA interconnect at a "well-known" service address
(on UDP or TCP, this corresponds to port 111).
An RPC client can connect to this service address
and
use the portmap protocol to obtain a service address
in response to a program number
(e.g., a TCP port number or an InfiniBand GID).
</li>
<li>
Alternately, an RPC client can connect
to the mapped well-known port for the service itself,
if it is appropriately defined.
By convention, the NFS/RDMA service,
when operating atop an InfiniBand fabric,
uses the same 20049 assignment as for MPA.
</li>
</ul>
<t>
Historically, different RPC protocols have taken different approaches
to their port assignments.
The current document leaves the specific method for each RPC-over-RDMA-enabled ULB.
</t>
<t>
<xref target="RFC8166" format="default" sectionFormat="of"/>
defines two new netid values
to be used for registration of upper layers atop MPA
and (when a suitable port translation service is available) InfiniBand.
Additional RDMA-capable networks
<bcp14>MAY</bcp14>
define their own netids, or if they provide a port translation, they
<bcp14>MAY</bcp14>
share the one defined in
<xref target="RFC8166" format="default" sectionFormat="of"/>.
</t>
</section>

<section
 anchor="section_fd2f45e7-85fa-4863-a4cd-ea200878062f"
 numbered="true"
 removeInRFC="true"
 toc="default">
<name>Implementation Status</name>
<t>
This section records the status of known implementations of the protocol
defined by this specification at the time of posting of this Internet-Draft,
and is based on a proposal described in
<xref target="RFC7942" format="default" sectionFormat="of"/>.
The description of implementations in this section is intended to
assist the IETF in its decision processes in progressing drafts to RFCs.
</t>
<t>
Please note that the listing of any individual implementation here
does not imply endorsement by the IETF.
Furthermore, no effort has been spent to verify the information presented here
that was supplied by IETF contributors.
This is not intended as, and must not be construed to be,
a catalog of available implementations or their features.
Readers are advised to note that other implementations may exist.
</t>
<t>
At this time, no known implementations of the protocol
described in the current document exist.
</t>
</section>

<section
 anchor="section_912a2c09-95ec-4cb6-aa2b-2245726d9edf"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Security Considerations</name>

<section
 anchor="section_9e0f4573-2f3e-4758-9a9d-c5ae8f54d5f6"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Memory Protection</name>
<t>
A primary consideration is the protection of the integrity and confidentiality
of host memory by an RPC-over-RDMA transport.
The use of an RPC-over-RDMA transport protocol <bcp14>MUST NOT</bcp14> introduce vulnerabilities
to system memory contents nor memory owned by user processes.
Any RDMA provider used for RPC transport <bcp14>MUST</bcp14> conform
to the requirements of
<xref target="RFC5042" format="default" sectionFormat="of"/>
to satisfy these protections.
</t>

<section
 anchor="section_2ca69bb1-1d40-4226-a375-face55cf0108"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Protection Domains</name>
<t>
The use of a Protection Domain to limit the exposure of memory regions
to a single connection is critical.
Any attempt by an endpoint not participating in that connection
to reuse memory handles needs to result in immediate failure of that connection.
Because ULP security mechanisms rely
on this aspect of Reliable Connected behavior,
implementations <bcp14>SHOULD</bcp14> cryptographically authenticate connection endpoints.
</t>
</section>

<section
 anchor="section_35460fd7-b8e5-4c95-901c-fbe827b61966"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Handle (STag) Predictability</name>
<t>
Implementations should use unpredictable memory handles
for any operation requiring exposed memory regions.
Exposing a continuously registered memory region
allows a remote host to read or write to that region
even when an RPC involving that memory is not underway.
Therefore, implementations should avoid
the use of persistently registered memory.
</t>
</section>

<section
 anchor="section_958d6c42-3fa3-4368-9ea9-2f43b8795bcb"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Memory Protection</name>
<t>
Requesters should register memory regions for remote access
only when they are about to be the target of an RPC transaction
that involves an RDMA Read or Write.
</t>
<t>
Requesters should invalidate memory regions
as soon as related RPC operations are complete.
Invalidation and DMA unmapping of memory regions should complete
before the receiver checks message integrity,
and
before the RPC consumer can use or alter the contents of the exposed memory region.
</t>
<t>
An RPC transaction on a Requester can terminate
before a Reply arrives, for example,
if the RPC consumer is signaled, or a segmentation fault occurs.
When an RPC terminates abnormally, memory regions associated with that RPC
should be invalidated before the Requester reuses those regions for other purposes.
</t>
</section>

<section
 anchor="section_71519a0b-0458-4ae8-912d-2f09a968ab09"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Denial of Service</name>
<t>
A detailed discussion of denial-of-service exposures
that can result from the use of an RDMA transport
appears in
<xref target="RFC5042" section="6.4" format="default" sectionFormat="of"/>.
</t>
<t>
A Responder is not obliged to pull unreasonably large Read chunks.
A Responder can use an RDMA2_ERROR response
to terminate RPCs with unreadable Read chunks.
If a Responder transmits more data than
a Requester is prepared to receive in a Write or Reply chunk,
the RDMA provider typically terminates the connection.
For further discussion, see
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="default" sectionFormat="of"/>.
Such repeated connection termination can deny service
to other users sharing the connection from the errant Requester.
</t>
<t>
An RPC-over-RDMA transport implementation is not responsible
for throttling the RPC request rate,
other than to keep the number of concurrent RPC transactions
within the per connection credit limits (see
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>).
A sender can trigger a self-denial of service
by exceeding the credit limit repeatedly.
</t>
<t>
When an RPC transaction terminates due to
a signal
or
premature exit of an application process,
a Requester should invalidate the RPC's Write and Reply chunks.
Invalidation prevents the subsequent arrival of the Responder's Reply
from altering the memory regions associated with those chunks
after the Requester has released that memory.
</t>
<t>
On the Requester,
a malfunctioning application
or
a malicious user
can create a situation where RPCs initiate and abort continuously,
resulting in Responder replies that terminate
the underlying RPC-over-RDMA connection repeatedly.
Such situations can deny service to other users
sharing the connection from that Requester.
</t>
</section>

</section>

<section
 anchor="section_4b069dfd-7532-4b9b-a9c9-1f0e8ee0d2fC"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC Message Security</name>
<t>
ONC RPC provides cryptographic security via the RPCSEC_GSS framework
<xref target="RFC7861" format="default" sectionFormat="of"/>.
RPCSEC_GSS implements
message authentication (rpc_gss_svc_none),
per-message integrity checking (rpc_gss_svc_integrity),
and
per-message confidentiality (rpc_gss_svc_privacy)
in a layer above the RPC-over-RDMA transport.
The integrity and privacy services require
significant computation and movement of data
on each endpoint host.
Some performance benefits enabled by RDMA transports can be lost.
</t>

<section
 anchor="section_ca56ff24-b218-455a-9faf-c8f7c17bf26c"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC-over-RDMA Protection at Other Layers</name>
<t>
For any RPC transport,
utilizing RPCSEC_GSS integrity or privacy services
has performance implications.
Protection below the RPC implementation is often
a better choice in performance-sensitive deployments,
especially if it, too, can be offloaded.
Certain implementations of IPsec can be co-located in RDMA hardware,
for example,
without change to RDMA consumers
and
with little loss of data movement efficiency.
Such arrangements can also provide a higher degree of privacy
by hiding endpoint identity
or
altering the frequency at which messages are exchanged,
at a performance cost.
</t>
<t>
Implementations <bcp14>MAY</bcp14> negotiate the use of protection
in another layer through the use of
an RPCSEC_GSS security flavor defined in
<xref target="RFC7861" format="default" sectionFormat="of"/>
in conjunction with
the Channel Binding mechanism
<xref target="RFC5056" format="default" sectionFormat="of"/>
and
IPsec Channel Connection Latching
<xref target="RFC5660" format="default" sectionFormat="of"/>.
</t>
</section>

<section
 anchor="section_61deca2d-94c4-4fd4-be7a-59ae3a77c9a2"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPCSEC_GSS on RPC-over-RDMA Transports</name>
<t>
Not all RDMA devices and fabrics support the above protection mechanisms.
Also, NFS clients, where multiple users can access NFS files,
still require per-message authentication.
In these cases, RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA connections.
</t>
<t>
RPCSEC_GSS extends the ONC RPC protocol without changing the format of RPC messages.
By observing the conventions described in this section,
an RPC-over-RDMA transport can convey RPCSEC_GSS-protected RPC messages interoperably.
</t>
<t>
Senders <bcp14>MUST NOT</bcp14> reduce protocol elements of RPCSEC_GSS
that appear in the Payload stream of an RPC-over-RDMA message.
Such elements include control messages
exchanged as part of establishing or destroying a security context,
or data items that are part of RPCSEC_GSS authentication material.
</t>

<section
 anchor="section_17014ff8-d5ef-4db8-bbb5-337a07cd66e2"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPCSEC_GSS Context Negotiation</name>
<t>
Some NFS client implementations use a separate connection
to establish a Generic Security Service (GSS) context for NFS operation.
Such clients use TCP and the standard NFS port (2049) for context establishment.
Therefore, an NFS server <bcp14>MUST</bcp14> also provide
a TCP-based NFS service on port 2049 to enable the use of RPCSEC_GSS with NFS/RDMA.
</t>
</section>

<section
 anchor="section_77AA4781-E811-4B0D-8704-F96CCD4888DF"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC-over-RDMA with RPCSEC_GSS Authentication</name>
<t>
The RPCSEC_GSS authentication service has no impact
on the DDP-eligibility of data items in a ULP.
</t>
<t>
However, RPCSEC_GSS authentication material
appearing in an RPC message header can be larger than, say,
an AUTH_SYS authenticator.
In particular, when an RPCSEC_GSS pseudoflavor is in use,
a Requester needs to accommodate a larger RPC credential
when marshaling RPC Calls
and
needs to provide for a maximum size RPCSEC_GSS verifier
when allocating reply buffers and Reply chunks.
</t>
<t>
RPC messages, and thus Payload streams,
are larger on average as a result.
ULP operations that fit in a Simple Format message
when a simpler form of authentication is in use
might need to be reduced or conveyed via a Special Format message
when RPCSEC_GSS authentication is in use.
It is therefore more likely that a Requester provisions
both a Read list and a Reply chunk
in the same RPC-over-RDMA Transport header
to convey a Special Format Call
and provision a receptacle for a Special Format Reply.
</t>
<t>
In addition to this cost,
the XDR encoding and decoding of each RPC message
using RPCSEC_GSS authentication
requires per-message host compute resources
to construct the GSS verifier.
</t>
</section>

<section
 anchor="section_4574ad52-fe73-4679-a808-ae3612c60f24"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy</name>
<t>
The RPCSEC_GSS integrity service enables endpoints
to detect the modification of RPC messages in flight.
The RPCSEC_GSS privacy service prevents
all but the intended recipient
from viewing the cleartext content of RPC arguments and results.
RPCSEC_GSS integrity and privacy services are end-to-end.
They protect RPC arguments and results from application to server endpoint, and back.
</t>
<t>
The RPCSEC_GSS integrity and encryption services operate
on whole RPC messages after they have been XDR encoded,
and before they have been XDR decoded after receipt.
Connection endpoints use intermediate buffers
to prevent exposure of encrypted
or
unverified cleartext data to RPC consumers.
After a sender has
verified,
encrypted,
and
wrapped a message,
the transport layer <bcp14>MAY</bcp14> use RDMA data transfer
between these intermediate buffers.
</t>
<t>
The process of reducing a DDP-eligible data item removes
the data item
and
its XDR padding
from an encoded Payload stream.
In a non-protected RPC-over-RDMA message,
a reduced data item does not include XDR padding.
After reduction, the Payload stream contains fewer octets
than the whole XDR stream did beforehand.
XDR padding octets are often zero bytes, but they don't have to be.
Thus, reducing DDP-eligible items affects
the result of message integrity verification and encryption.
</t>
<t>
Therefore, a sender <bcp14>MUST NOT</bcp14> reduce a Payload stream
when RPCSEC_GSS integrity or encryption services are in use.
Effectively, no data item is DDP-eligible in this situation.
Senders can use only Simple and Continued Formats without data item chunks,
or Special Format.
In this mode, an RPC-over-RDMA transport operates
in the same manner as a transport that does not support DDP.
</t>
</section>

<section
 anchor="section_0cddd345-064a-4c96-b251-17afce70219f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Protecting RPC-over-RDMA Transport Headers</name>
<t>
Like the header fields in an RPC message
(e.g., the xid and mtype fields),
RPCSEC_GSS does not protect the RPC-over-RDMA Transport stream.
XIDs, connection credit limits, and chunk lists
(though not the content of the data items they refer to)
are exposed to malicious behavior,
which can redirect data that is transferred by the RPC-over-RDMA message,
result in spurious retransmits,
or
trigger connection loss.
</t>
<t>
In particular,
if an attacker alters the information
contained in the chunk lists of an RPC-over-RDMA Transport header,
data contained in those chunks can be redirected
to other registered memory regions on Requesters.
An attacker might alter the arguments
of RDMA Read and RDMA Write operations on the wire
to gain a similar effect.
If such alterations occur,
the use of RPCSEC_GSS integrity or privacy services
enables a Requester to detect unexpected material in a received RPC message.
</t>
<t>
Encryption at other layers, as described in
<xref target="section_ca56ff24-b218-455a-9faf-c8f7c17bf26c" format="default" sectionFormat="of"/>,
protects the content of the Transport stream.
RDMA transport implementations should conform to
<xref target="RFC5042" format="default" sectionFormat="of"/>
to address attacks on RDMA protocols themselves.
</t>
</section>

</section>

</section>

<section
 anchor="section_3b0e673b-98d7-436d-bd6f-180180503df6"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Properties</name>
<t>
Like other fields that appear in the Transport stream,
transport properties are sent in the clear with no integrity protection,
making them vulnerable to man-in-the-middle attacks.
</t>
<t>
For example, if a man-in-the-middle were to change the value
of the Receive buffer size, it could
reduce connection performance
or
trigger loss of connection.
Repeated connection loss can impact performance
or
even prevent a new connection from being established.
The recourse is to
deploy on a private network
or
use transport layer encryption.
</t>
</section>

<section
 anchor="section_c85be87e-4f2b-4caf-8ae5-acdaa972a9f9"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Host Authentication</name>
<t>
[ cel: This subsection is unfinished. ]
</t>
<t>
Wherein we use the relevant sections of
<xref target="RFC3552" format="default" sectionFormat="of"/>
to analyze the addition of host authentication
to this RPC-over-RDMA transport.
</t>
<t>
The authors refer readers to Appendix C of
<xref target="RFC8446" format="default" sectionFormat="of"/>
for information on how to design and test
a secure authentication handshake implementation.
</t>
</section>

</section>

<section
 anchor="section_d235c884-6463-411f-ba34-6bcc82ab7a9f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>IANA Considerations</name>
<t>
The RPC-over-RDMA family of transports have been assigned RPC netids
by
<xref target="RFC8166" format="default" sectionFormat="of"/>.
A netid is an rpcbind
<xref target="RFC1833" format="default" sectionFormat="of"/>
string used to identify the underlying protocol
in order for RPC to select appropriate transport framing
and the format of the service addresses and ports.
</t>
<t>The following netid registry strings are already defined for this purpose:
</t>
<artwork
 align="left"
 alt=""
 anchor="artwork_bfbf9434-d33a-4f66-a0bf-2fdee624155d"
 name=""
 type="">
<![CDATA[
   NC_RDMA "rdma"
   NC_RDMA6 "rdma6"
]]>
</artwork>
<t>
The "rdma" netid is to be used when IPv4 addressing is employed by the underlying transport,
and "rdma6" when IPv6 addressing is employed.
The netid assignment policy and registry are defined in
<xref target="RFC5665" format="default" sectionFormat="of"/>.
The current document does not alter these netid assignments.
</t>
<t>
These netids <bcp14>MAY</bcp14> be used for any RDMA network that satisfies the
requirements of
<xref target="section_6903045e-bd1c-4e12-bf96-6b534989f46A" format="default" sectionFormat="of"/>
and that is able to identify service endpoints using IP port addressing,
possibly through use of a translation service as described in
<xref target="section_e914de0a-05f3-4e14-a067-fb49a4f9b0ad" format="default" sectionFormat="of"/>.
</t>
</section>

</middle>

<back>

<references>
<name>References</name>

<references>
<name>Normative References</name>

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1833.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5042.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5056.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5531.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5660.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5665.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7861.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7942.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8166.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8267.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8446.xml"/>

</references>

<references>
<name>Informative References</name>

<xi:include
 xmlns:xi="http://www.w3.org/2001/XInclude"
 href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.draft-ietf-nfsv4-rpc-tls-11.xml"/>

<reference anchor="IBA">
<front>
<title>InfiniBand Architecture Specification Volume 1</title>
<seriesInfo name="Release" value="1.3"/>
<author>
<organization showOnFrontPage="true">InfiniBand Trade Association</organization>
</author>
<date month="March" year="2015"/>
</front>
<annotation>
Available from https://www.infinibandta.org/
</annotation>
</reference>

<reference anchor="CBFC">
<front>
<title>
Credit-Based Flow Control for ATM Networks: Credit Update Protocol, Adaptive Credit Allocation, and Statistical Multiplexing
</title>
<seriesInfo name="Proc." value="ACM SIGCOMM '94 Symposium on Communications Architectures, Protocols and Applications, pp. 101-114."/>
<author initials="H.T." surname="Kung">
<organization showOnFrontPage="true">Division of Applied Sciences, Harvard University</organization>
<address>
<postal>
<street>29 Oxford Street</street>
<city>Cambridge</city>
<region>MA</region>
<code>02138</code>
<country>United States of America</country>
</postal>
</address>
</author>
<author initials="T." surname="Blackwell">
<organization showOnFrontPage="true">Division of Applied Sciences, Harvard University</organization>
<address>
<postal>
<street>29 Oxford Street</street>
<city>Cambridge</city>
<region>MA</region>
<code>02138</code>
<country>United States of America</country>
</postal>
</address>
</author>
<author initials="A." surname="Chapman">
<organization showOnFrontPage="true">Bell-Northern Research</organization>
<address>
<postal>
<street>P.O.Box 3511, Station C</street>
<city>Ottawa</city>
<region>Ontario</region>
<code>KIY 4H7</code>
<country>Canada</country>
</postal>
</address>
</author>
<date month="August" year="1994"/>
</front>
</reference>

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.0768.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.0793.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1094.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1813.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3552.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5040.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5041.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5044.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5532.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5662.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5666.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7530.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8167.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/>

</references>

</references>

<section
 anchor="section_9e003b83-66b5-43d7-b9ef-0f271c8d301b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>ULB Specifications</name>
<t>
Typically, an Upper-Layer Protocol (ULP) is defined
without regard to a particular RPC transport.
An Upper-Layer Binding (ULB) specification
provides guidance that helps a ULP interoperate
correctly and efficiently over a particular transport.
For RPC-over-RDMA version 2, a ULB may provide:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A taxonomy of XDR data items that are eligible for DDP
</li>
<li>
Constraints on which upper-layer procedures
a sender may reduce,
and
on how many chunks may appear in a single RPC message
</li>
<li>
A method enabling a Requester to determine
the maximum size of the reply Payload
stream for all procedures in the ULP
</li>
<li>
An rpcbind port assignment for the RPC Program and Version
when operating on the particular transport
</li>
</ul>
<t>
Each RPC Program and Version tuple that operates
on RPC-over-RDMA version 2 needs to have a ULB specification.
</t>

<section
 anchor="section_2f2b32a4-d78a-45f0-b6a3-fa0e2d34a97b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>DDP-Eligibility</name>
<t>
A ULB designates specific XDR data items as eligible for DDP.
As a sender constructs an RPC-over-RDMA message,
it can remove DDP-eligible data items from the Payload stream
so that the RDMA provider can place them
directly in the receiver's memory.
An XDR data item should be considered for DDP-eligibility
if there is a clear benefit to moving the contents
of the item directly from the sender's memory
to the receiver's memory.
</t>
<t>
Criteria for DDP-eligibility include:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The XDR data item is frequently sent or received, and its size is
often much larger than typical inline thresholds.
</li>
<li>
If the XDR data item is a result, its maximum size must be
predictable in advance by the Requester.
</li>
<li>
Transport-level processing of the XDR data item is not needed.
For example, the data item is an opaque byte array, which requires
no XDR encoding and decoding of its content.
</li>
<li>
The content of the XDR data item is sensitive to address alignment.
For example, a data copy operation would be required
on the receiver to enable the message to be parsed correctly,
or
to enable the data item to be accessed.
</li>
<li>
The XDR data item itself does not contain DDP-eligible data items.
</li>
</ul>
<t>
In addition to defining the set of data items that are DDP-eligible,
a ULB may limit the use of chunks
to particular upper-layer procedures.
If more than one data item in a procedure is DDP-eligible,
the ULB may limit the number of chunks
that a Requester can provide for a particular upper-layer procedure.
</t>
<t>
Senders never reduce data items that are not DDP-eligible.
Such data items can, however, be part of a Special Format payload.
</t>
<t>
The programming interface by which an upper-layer implementation
indicates the DDP-eligibility of a data item to the RPC transport
is not described by this specification.
The only requirements are
that the receiver can re-assemble
the transmitted RPC-over-RDMA message into a valid XDR stream
and
that DDP-eligibility rules specified by the ULB are respected.
</t>
<t>
There is no provision to express DDP-eligibility within the XDR
language.
The only definitive specification of DDP-eligibility is a ULB.
</t>
<t>
In general, a DDP-eligibility violation occurs when:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A Requester reduces a non-DDP-eligible argument data item.
The Responder reports the violation as described in
<xref target="section_b1d23e5c-31df-483f-adb7-25430b5de38d" format="default" sectionFormat="of"/>.
</li>
<li>
A Responder reduces a non-DDP-eligible result data item.
The Requester terminates the pending RPC transaction
and reports an appropriate permanent error to the RPC consumer.
</li>
<li>
A Responder does not reduce a DDP-eligible result data item into
an available Write chunk.
The Requester terminates the pending RPC transaction
and reports an appropriate permanent error to the RPC consumer.
</li>
</ul>
</section>

<section
 anchor="section_a3f15fe3-2677-4d94-adb5-afa80c7e197a"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Maximum Reply Size</name>
<t>
When expecting small and moderately-sized Replies,
a Requester should rely on Message Continuation
rather than provision a Reply chunk.
For each ULP procedure where there is no clear Reply size maximum
and the maximum can be substantial,
the ULB should specify a dependable means
for determining the maximum Reply size.
</t>
</section>

<section
 anchor="section_a5e8ada2-5473-4321-beee-41c29e04227f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Reverse-Direction Operation</name>
<t>
The direction of operation does not preclude
the need for DDP-eligibility statements.
</t>
<t>
Reverse-direction operation occurs
on an already-established connection.
Specification of RPC binding parameters is
usually not necessary in this case.
</t>
<t>
Other considerations may apply
when distinct RPC Programs
share an RPC-over-RDMA transport connection concurrently.
</t>
</section>

<section
 anchor="section_e568723e-3dd8-4a54-9401-92ae4f98f88f"
 numbered="true"
 removeInRFC="false"

 toc="default">
<name>Additional Considerations</name>
<t>
There may be other details provided in a ULB.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A ULB may recommend
inline threshold values
or
other transport-related parameters
for RPC-over-RDMA version 2 connections bearing that ULP.
</li>
<li>
A ULP may provide a means to communicate transport-related parameters
between peers.
</li>
<li>
Multiple ULPs may share
a single RPC-over-RDMA version 2 connection when
their ULBs allow the use of RPC-over-RDMA version 2
and
the rpcbind port assignments for those protocols permit connection sharing.
In this case, the same transport parameters (such as inline threshold)
apply to all ULPs using that connection.
</li>
</ul>
<t>
Each ULB needs to be designed to allow correct interoperation
without regard to the transport parameters actually in use.
Furthermore, implementations of ULPs must be designed
to interoperate correctly
regardless of the connection parameters in effect on a connection.
</t>
</section>

<section
 anchor="section_db58c83f-091c-481e-ba7c-a0246d1c475b"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>ULP Extensions</name>
<t>
An RPC Program and Version tuple may be extensible.
For instance,
the RPC version number may not reflect a ULP minor versioning scheme,
or
the ULP may allow the specification of additional features
after the publication of the original RPC Program specification.
ULBs are provided for interoperable RPC Programs and Versions by
extending existing ULBs to reflect the changes made necessary by each
addition to the existing XDR.
</t>
<t>
[ cel:
The final sentence is unclear, and may be inaccurate.
I believe I copied this section directly from RFC 8166.
Is there more to be said, now that we have some experience? ]
</t>
</section>

</section>

<section
 anchor="section_84e1ffc4-d916-4eb4-9fd8-a8218d084503"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Extending RPC-over-RDMA Version 2</name>
<t>
This Appendix is not addressed to protocol implementers,
but rather to authors of documents that
extend the protocol specified in the current document.
</t>
<t>
RPC-over-RDMA version 2 extensibility facilitates
limited extensions to the base protocol
presented in the current document
so that new optional capabilities can be introduced
without a protocol version change
while maintaining robust interoperability
with existing RPC-over-RDMA version 2 implementations.
It allows extensions to be defined,
including the definition of new protocol elements,
without requiring modification or recompilation
of the XDR for the base protocol.
</t>
<t>
Standards Track documents may introduce extensions
to the base RPC-over-RDMA version 2 protocol in two ways:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
They may introduce new <bcp14>OPTIONAL</bcp14> transport header types.
<xref target="section_d4650151-40f0-4e85-8755-02c38cf8f444" format="default" sectionFormat="of"/>
covers such transport header types.
</li>
<li>
They may define new <bcp14>OPTIONAL</bcp14> transport properties.
<xref target="section_a355adad-f03b-41a6-94a8-4128b10301bb" format="default" sectionFormat="of"/>
describes such transport properties.
</li>
</ul>
<t>
These documents may also add the following sorts
of ancillary protocol elements to the protocol
to support the addition of
new transport properties and header types:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
They may create new error codes, as described in
<xref target="section_c2f5e937-d612-4e3a-a380-e0b15261f6a0" format="default" sectionFormat="of"/>.
</li>
</ul>
<t>
New capabilities can be
proposed
and
developed
independently of each other.
Implementers can choose among them,
making it straightforward
to create
and
document
experimental features
and then bring them through the standards process.
</t>

<section
 anchor="section_1c2304a3-0dc6-4ca1-b710-77a7b07f3d19"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Documentation Requirements</name>
<t>
As described earlier,
a Standards Track document introduces a set of new protocol elements.
Together these elements are considered an <bcp14>OPTIONAL</bcp14> feature.
Each implementation is either
aware of all the protocol elements introduced by that feature
or
is aware of none of them.
</t>
<t>
Documents specifying extensions to RPC-over-RDMA version 2 should contain:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
An explanation of the purpose and use of each new protocol element.
</li>
<li>
An XDR description including all of the new protocol elements,
and a script to extract it.
</li>
<li>
A discussion of interactions with other extensions.
This discussion includes
requirements for other <bcp14>OPTIONAL</bcp14> features to be present,
or
that a particular level of support for an <bcp14>OPTIONAL</bcp14> facility is required.
</li>
</ul>
<t>
Implementers combine the XDR descriptions
of the new features they intend to use
with
the XDR description of the base protocol in the current document.
This combination is necessary to create a valid XDR input file
because extensions are
free to use XDR types defined in the base protocol,
and
later extensions may use types defined by earlier extensions.
</t>
<t>
The XDR description for the RPC-over-RDMA version 2 base protocol
combined with that for any selected extensions
should provide a human-readable and compilable definition
of the extended protocol.
</t>
</section>

<section
 anchor="section_d4650151-40f0-4e85-8755-02c38cf8f444"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Adding New Header Types to RPC-over-RDMA Version 2</name>
<t>
New transport header types are defined similar to Sections
<xref target="section_7e1de71a-9b68-4bd0-8213-97991139ab87" format="counter" sectionFormat="of"/>
through
<xref target="section_67bf730f-38f1-40bc-8eec-1bedde7b0449" format="counter" sectionFormat="of"/>.
In particular, what is needed is:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A description of the function and use of the new header type.
</li>
<li>
A complete XDR description of the new header type.
</li>
<li>
A description of how receivers report errors,
including mechanisms for reporting errors
outside the available choices already available
in the base protocol or other extensions.
</li>
<li>
An indication of whether a Payload stream must be present,
and a description of its contents
and
how receivers use such Payload streams to reconstruct RPC messages.
</li>
<li>
As appropriate, a statement of
whether a Responder may use Remote Invalidation
when sending messages that contain the new header type.
</li>
</ul>
<t>
There needs to be additional documentation that is made necessary
due to the <bcp14>OPTIONAL</bcp14> status of new transport header types:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The document should discuss constraints on support for the new header types.
For example, if support for one header type is implied or foreclosed
by another one, this needs to be documented.
</li>
<li>
The document should describe the preferred method
by which a sender determines whether its peer
supports a particular header type.
It is always possible to send a test invocation
of a particular header type to see if support is available.
However, when more efficient means are available
(e.g., the value of a transport property),
this should be noted.
</li>
</ul>
</section>

<section
 anchor="section_a355adad-f03b-41a6-94a8-4128b10301bb"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Adding New Transport properties to the Protocol</name>
<t>
A Standards Track document defining a new transport property
should include the following information
paralleling that provided in this document
for the transport properties defined herein:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The rpcrdma2_propid value identifying the new property.
</li>
<li>
The XDR typedef specifying the structure of its property value.
</li>
<li>
A description of the new property.
</li>
<li>
An explanation of how the receiver can use this information.
</li>
<li>
The default value if a peer never receives the new property.
</li>
</ul>
<t>
There is no requirement that propid assignments
occur in a continuous range of values.
Implementations should not rely on all such values being small integers.
</t>
<t>
Before the defining Standards Track document is published,
the nfsv4 Working Group should select a unique propid value, and ensure that:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
rpcrdma2_propid values specified in the document do not conflict with
those currently assigned
or
in use by other pending working group documents defining transport properties.
</li>
<li>
<t>
rpcrdma2_propid values specified in the document do not conflict with
the range reserved for experimental use,
as defined in Section&nbsp;8.2.
</t>
<t>
[ cel:
 There is no longer a section 8.2 or an experimental range of propid values.
 Should we request the creation of an IANA registry for propid values? ].
</t>
</li>
</ul>
<t>
When a Standards Track document proposes additional transport properties,
reviewers should deal with possible security issues
exposed by those new transport properties.
</t>
</section>

<section
 anchor="section_c2f5e937-d612-4e3a-a380-e0b15261f6a0"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Adding New Error Codes to the Protocol</name>
<t>
The same Standards Track document that defines a new header type
may introduce new error codes used to support it.
A Standards Track document may similarly define
new error codes that an existing header type can return.
</t>
<t>
For error codes that do not require
the return of additional information,
a peer can use the existing RDMA_ERR2 header type to report the new error.
The sender sets the new error code
as the value of rdma_err with the result
that the default switch arm of the rpcrdma2_error (i.e., void) is selected.
</t>
<t>
For error codes that do require
the return of related information together with the error,
a new header type should be defined that returns the error
together with the related information.
The sender of a new header type needs to be prepared
to accept header types necessary to report associated errors.
</t>
</section>

</section>

<section
 anchor="section_c2574344-5aec-427d-a5ed-048d7fcc0d95"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Differences from RPC-over-RDMA Version 1</name>
<t>
The primary goal of RPC-over-RDMA version 2 is
to relieve constraints that have
become evident in RPC-over-RDMA version 1
with deployment experience:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
RPC-over-RDMA version 1 has been challenging to update
to address shortcomings or improve data transfer efficiency.
</li>
<li>
The average size of NFSv4 COMPOUNDs is significantly greater than
NFSv3 requests, requiring the use of Long messages for frequent operations.
</li>
<li>
Reply size estimation is difficult more often than first expected.
</li>
</ul>
<t>
This section details specific changes in RPC-over-RDMA version 2
that address these constraints directly,
in addition to other changes to make implementation easier.
</t>

<section
 anchor="section_d945b9f0-0666-4db7-9126-be57cf7b5f4f"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Changes to the XDR Definition</name>
<t>
Several XDR structural changes enable
within-version protocol extensibility.
</t>
<t>
<xref target="RFC8166" format="default" sectionFormat="of"/>
defines the RPC-over-RDMA version 1 transport header
as a single XDR object,
with an RPC message potentially following it.
In RPC-over-RDMA version 2,
there are separate XDR definitions of the transport header prefix (see
<xref target="section_2d1735f0-c465-43c6-9c18-3da6b7979862" format="default" sectionFormat="of"/>),
which specifies the transport header type to be used,
and the transport header itself
(defined within one of the subsections of
<xref target="section_8039c7b8-9068-401e-9cbd-5c1e67d403e7" format="default" sectionFormat="of"/>).
This construction is similar to an RPC message,
which consists of an RPC header (defined in
<xref target="RFC5531" format="default" sectionFormat="of"/>)
followed by a message defined by an Upper-Layer Protocol.
</t>
<t>
As a new version of the RPC-over-RDMA transport protocol,
RPC-over-RDMA version 2 exists within the versioning rules defined in
<xref target="RFC8166" format="default" sectionFormat="of"/>.
In particular, it maintains the first four words of the protocol header,
as specified in
<xref target="RFC8166" section="4.2" format="default" sectionFormat="of"/>,
even though, as explained in
<xref target="section_e21d4f74-b536-47f2-9d07-c03a27a20de4" format="default" sectionFormat="of"/>
of the current document,
the XDR definition of those words is structured differently.
</t>
<t>
Although each of the first four fields retains its semantic function,
there are differences in interpretation:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The first word of the header, the rdma_xid field,
retains the format and function that it had in RPC-over-RDMA version 1.
Because RPC-over-RDMA version 2 messages can convey non-RPC messages,
a receiver should not use the contents of this field
without consideration of the protocol version and header type.
</li>
<li>
The second word of the header, the rdma_vers field,
retains the format and function that it had in RPC-over-RDMA version 1.
To clearly distinguish version 1 and version 2 messages,
senders need to fill in the correct version
(fixed after version negotiation).
Receivers should check that the content of the rdma_vers is correct
before using the content of any other header field.
</li>
<li>
The third word of the header, the rdma_credit field,
retains the size and general purpose that it had in RPC-over-RDMA version 1.
However, RPC-over-RDMA version 2 divides this field
into two 16-bit subfields. See
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>
for further details.
</li>
<li>
The fourth word of the header,
previously the union discriminator field rdma_proc,
retains its format and general function
even though the set of valid values has changed.
Within RPC-over-RDMA version 2,
this word is the rdma_htype field of the structure rdma_start.
The value of this field is now an unsigned 32-bit integer
rather than an enum type, to facilitate header type extension.
</li>
</ul>
<t>
Beyond conforming to the restrictions specified in
<xref target="RFC8166" format="default" sectionFormat="of"/>,
RPC-over-RDMA version 2 attempts to limit
the scope of the changes made to ensure interoperability.
Although it introduces the Call chunk
and splits the two version 1 workhorse procedure types
RDMA_MSG and RDMA_NOMSG into several variants,
RPC-over-RDMA version 2 otherwise
expresses chunks in the same format and utilizes them the same way.
</t>
</section>

<section
 anchor="section_630314a8-1cf5-40f7-a5ad-5bc12c719233"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Transport Properties</name>
<t>
RPC-over-RDMA version 2 provides a mechanism
for exchanging an implementation's operational properties.
The purpose of this exchange is to help endpoints
improve the efficiency of data transfer
by exploiting the characteristics of both peers
rather than falling back on
the lowest common denominator default settings.
A full discussion of transport properties appears in
<xref target="section_86248e99-ca60-478a-8aff-3fb387410077" format="default" sectionFormat="of"/>.
</t>
</section>

<section
 anchor="section_5a2e5ff8-0f0b-454d-b9b3-c6773cd77780"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Credit Management Changes</name>
<t>
RPC-over-RDMA transports employ credit-based flow control
to ensure that a Requester does not emit more RDMA Sends
than the Responder is prepared to receive.
</t>
<t>
<xref target="RFC8166" section="3.3.1" format="default" sectionFormat="of"/>
explains the operation of RPC-over-RDMA version 1 credit management
in detail.
In that design, each RDMA Send from a Requester
contains an RPC Call with a credit request,
and each RDMA Send from a Responder contains an RPC Reply
with a credit grant.
The credit grant implies that enough Receives
have been posted on the Responder to handle the credit grant
minus the number of pending RPC transactions
(the number of remaining Receive buffers might be zero).
</t>
<t>
Each RPC Reply acts as an implicit ACK
for a previous RPC Call from the Requester.
Without an RPC Reply message,
the Requester has no way to know that
the Responder is ready for subsequent RPC Calls.
</t>
<t>
Because version 1 embeds credit management in each message,
there is a strict one-to-one ratio between RDMA Send and RPC message.
There are interesting use cases that might be enabled
if this relationship were more flexible:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
RPC-over-RDMA operations that do not carry an RPC message,
e.g., control plane operations.
</li>
<li>
A single RDMA Send that conveys more than one RPC message,
e.g., for interrupt mitigation.
</li>
<li>
An RPC message that requires several sequential RDMA Sends,
e.g., to reduce the use of explicit RDMA operations for moderate-sized RPC messages.
</li>
<li>
An RPC transaction that requires multiple exchanges
or
an odd number of RPC-over-RDMA operations to complete.
</li>
</ul>
<t>
RPC-over-RDMA version 2 provides
a more sophisticated credit accounting mechanism
to address these shortcomings.
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>
explains the new mechanism in detail.
</t>
</section>

<section
 anchor="section_f7fb5108-58ea-4718-84be-5119a302f5f5"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Inline Threshold Changes</name>
<t>
An "inline threshold" value is the largest message size (in octets)
that can be conveyed on an RDMA connection
using only RDMA Send and Receive.
Each connection has two inline threshold values:
one for messages flowing from client-to-server
(referred to as the "client-to-server inline threshold")
and one for messages flowing from server-to-client
(referred to as the "server-to-client inline threshold").
</t>
<t>
A connection's inline thresholds determine,
among other things,
when RDMA Read or Write operations are required
because an RPC message cannot be conveyed via a single RDMA Send and Receive pair.
When an RPC message does not contain DDP-eligible data items,
a Requester can prepare a Special Format Call or Reply
to convey the whole RPC message using RDMA Read or Write operations.
</t>
<t>
RDMA Read and Write operations require that data payloads
reside in memory registered with the local RNIC.
When an RPC completes,
that memory is invalidated to fence it from the Responder.
Memory registration and invalidation typically have a latency cost
that is insignificant compared to data handling costs.
</t>
<t>
When a data payload is small, however,
the cost of registering and invalidating
memory where the payload resides
becomes a significant part of total RPC latency.
Therefore the most efficient operation of an RPC-over-RDMA transport
occurs when the peers use explicit RDMA Read and Write operations
for large payloads but avoid those operations for small payloads.
</t>
<t>
When the authors of
<xref target="RFC8166" format="default" sectionFormat="of"/>
first conceived RPC-over-RDMA version 1,
the average size of RPC messages
that did not involve a significant data payload
was under 500 bytes.
A 1024-byte inline threshold adequately minimized
the frequency of inefficient Long messages.
</t>
<t>
With NFS version 4
<xref target="RFC7530" format="default" sectionFormat="of"/>,
the increased size of NFS COMPOUND operations
resulted in RPC messages that are, on average,
larger than previous versions of NFS.
With a 1024-byte inline threshold,
frequent operations such as GETATTR and LOOKUP
require RDMA Read or Write operations,
reducing the efficiency of data transport.
</t>
<t>
To reduce the frequency of Special Format messages,
RPC-over-RDMA version 2 increases the default size of inline thresholds.
This change also increases the maximum size
of reverse-direction RPC messages.
</t>
</section>

<section
 anchor="section_1f3a1439-702f-4309-8733-5fa0e20555f4"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Message Continuation Changes</name>
<t>
In addition to a larger default inline threshold,
RPC-over-RDMA version 2 introduces Message Continuation.
Message Continuation is a mechanism
that enables the transmission of a data payload
using more than one RDMA Send.
The purpose of Message Continuation is
to provide relief in several essential cases:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
If a Requester finds that it is inefficient
to convey a moderately-sized data payload using Read chunks,
the Requester can use Message Continuation to send the RPC Call.
</li>
<li>
If a Requester has provided insufficient Reply chunk space
for a Responder to send an RPC Reply,
the Responder can use Message Continuation to send the RPC Reply.
</li>
<li>
If a sender has to convey a sizeable non-RPC data payload
(e.g., a large transport property),
the sender can use Message Continuation to avoid having to register memory.
</li>
</ul>
</section>

<section
 anchor="section_16f03208-32cb-451d-90ab-a6f5f4b9e9b0"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Host Authentication Changes</name>
<t>
For the general operation of NFS on open networks,
we eventually intend to rely on RPC-on-TLS
<xref target="I-D.ietf-nfsv4-rpc-tls" format="default" sectionFormat="of"/>
to provide cryptographic authentication of the two ends of each connection.
In turn, this can improve the trustworthiness of AUTH_SYS-style user identities
that flow on TCP, which are not cryptographically protected.
We do not have a similar solution for RPC-over-RDMA, however.
</t>
<t>
Here, the RDMA transport layer already provides
a strong guarantee of message integrity.
On some network fabrics, IPsec or TLS
can protect the privacy of in-transit data.
However, this is not the case for all fabrics
(e.g., InfiniBand
<xref target="IBA" format="default" sectionFormat="of"/>).
</t>
<t>
Thus, RPC-over-RDMA version 2 introduces a mechanism
for authenticating connection peers (see
<xref target="section_5f63e1b6-8d24-453b-b18b-b98ad66f3671" format="default" sectionFormat="of"/>).
And like GSS channel binding,
there is also a way to determine when the use of host authentication is unnecessary.
</t>
</section>

<section
 anchor="section_57c034d6-7129-4f7b-b8df-31e8bc691964"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Support for Remote Invalidation</name>
<t>
When an RDMA consumer uses FRWR or Memory Windows to register memory,
that memory may be invalidated remotely [RFC5040].
These mechanisms are available when
a Requester's RNIC supports MEM_MGT_EXTENSIONS.
</t>
<t>
For this discussion, there are two classes of STags.
Dynamically-registered STags appear in a single RPC,
then are invalidated.
Persistently-registered STags survive longer than one RPC.
They may persist for the life of an RPC-over-RDMA connection
or even longer.
</t>
<t>
An RPC-over-RDMA Requester can provide
more than one STag in a transport header.
It may provide
a combination of dynamically- and persistently-registered STags in one RPC message,
or
any combination of these in a series of RPCs on the same connection.
Only dynamically-registered STags using Memory Windows or FRWR
may be invalidated remotely.
</t>
<t>
There is no transport-level mechanism
by which a Responder can determine
how a Requester-provided STag was registered,
nor whether it is eligible to be invalidated remotely.
A Requester that mixes persistently- and dynamically-registered STags in one RPC,
or mixes them across RPCs on the same connection,
must, therefore, indicate which STag the Responder may invalidate remotely
via a mechanism provided in the Upper-Layer Protocol.
RPC-over-RDMA version 2 provides such a mechanism.
</t>
<t>
A sender uses the RDMA Send With Invalidate operation
to invalidate an STag on the remote peer.
It is available only when both peers support MEM_MGT_EXTENSIONS
(can send and process an IETH).
</t>
<t>
Existing RPC-over-RDMA transport protocol specifications
<xref target="RFC8166" format="default" sectionFormat="of"/>
<xref target="RFC8167" format="default" sectionFormat="of"/>
do not forbid direct data placement in the reverse direction.
Moreover, there is currently no Upper-Layer Protocol
that makes data items in reverse-direction operations
eligible for direct data placement.
</t>
<t>
When chunks are present in a reverse-direction RPC request,
Remote Invalidation enables the Responder
to trigger invalidation of a Requester's STags
as part of sending an RPC Reply,
the same way as is done in the forward direction.
</t>
<t>
However, in the reverse direction, the server acts as the Requester,
and the client is the Responder.
The server's RNIC, therefore, must support receiving an IETH,
and the server must have registered its STags
with an appropriate registration mechanism.
</t>
</section>

<section
 anchor="section_e936ff03-84d1-489c-9c5f-5541adbabc94"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Integration of Reverse-Direction Operation</name>
<t>
Because
<xref target="RFC5666" format="default" sectionFormat="of"/>
did not include specification of reverse-direction operation,
<xref target="RFC8166" format="default" sectionFormat="of"/>
does not include it either.
Reverse-direction operation in RPC-over-RDMA version 1 is
specified by a separate standards track document
<xref target="RFC8167" format="default" sectionFormat="of"/>.
</t>
<t>
Reverse-direction operation in RPC-over-RDMA version 1
was constrained by the limited ability to extend
that version of the protocol.
The most awkward issue is that
a receiver needs to peek at ingress RPC message payloads
to determine whether it is a Call or Reply message.
This is necessary because the meaning of several fields
in the RPC-over-RDMA transport header
is determined by the direction of the RPC message payload:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The meaning of the value in the rdma_xid field
is determined by the direction of the message because
the XID spaces in the forward and reverse directions are distinct.
</li>
<li>
The meaning of the value in the rdma_credit field
is determined by the direction of the message because
credits are granted separately
for forward and reverse direction operation.
</li>
<li>
The purpose of Write chunks and the meaning of their length fields
is determined by the direction of the message because
in Call messages, they are provisional,
but in Reply messages, they represent returned results.
</li>
</ul>
<t>
The current document remedies this awkwardness by
integrating reverse-direction operation
into RPC-over-RDMA version 2 so that it can
make use of all facilities that are available in the forward-direction,
including body chunks, remote invalidation, and message continuation.
To enable this integration,
the direction of the RPC message payload is encoded
in each RPC-over-RDMA version 2 transport header.
</t>
</section>

<section
 anchor="section_e554df42-6e82-4e28-96a0-8f9872eb476c"
 numbered="true"
 removeInRFC="false"
 toc="default">
<name>Error Reporting Changes</name>
<t>
RPC-over-RDMA version 2 expands the repertoire of errors
that connection peers may report to each other.
The goals of this expansion are:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
To fill in details of peer recovery actions.
</li>
<li>
To enable retrying certain conditions
caused by mis-estimation of the maximum reply size.
</li>
<li>
To minimize the likelihood of a Requester waiting forever for a Reply
when there are communications problems that prevent the Responder
from sending it.
</li>
</ul>
</section>


<section
 anchor="section_9180da71-3ea9-40f8-b6d2-ded634e4ce25"
 numbered="true"
 removeInRFC="false"
 toc="include">
<name>Changes in Terminology</name>
<t>
The RPC-over-RDMA version 2 specification makes
the following changes in terminology.
These changes do not result in
changes in the behavior or operation of the protocol.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The current document explicitly acknowledges the different semantics
and purpose of Write chunks appearing in Call messages
and those appearing in Reply messages.
</li>
<li>
The current document introduces the term "payload format"
to describe the selection of a mechanism for reducing
and conveying an RPC message payload.
It replaces the terms "short message" and "long message"
with the terms "simple format" and "special format"
because this selection is not based only on
the size of the payload.
</li>
<li>
The current document introduces the terms
"data item chunk"
and
"body chunk"
in order to distinguish the purpose and operation
of these two categories of chunk.
</li>
<li>
For improved readability,
the current document replaces the terms
"RDMA segment"
and
"plain segment"
with the term "segment",
and the term
"RDMA read segment"
with the term "Read segment".
</li>
<li>
The current document refers specifically to
the RDMAP, DDP, and MPA standards track protocols
rather than using the nebulous term "iWARP".
</li>
</ul>
</section>

</section>

<section
 anchor="section_7b212a81-9c2a-4c05-891a-369cc7184585"
 numbered="false"
 removeInRFC="false"
 toc="default">
<name>Acknowledgments</name>
<t>
The authors gratefully acknowledge the work of
<contact fullname="Brent Callaghan"/>
and
<contact fullname="Tom Talpey"/>
on the original RPC-over-RDMA version 1 specification
<xref target="RFC5666" format="default" sectionFormat="of"/>.
</t>
<t>
We are deeply indebted to
<contact fullname="Jana Igeyar"/>
for contributing the RPC-over-RDMA version 2
flow control mechanism described in
<xref target="section_45c67eb8-8dc6-47c3-8555-14270f1514bF" format="default" sectionFormat="of"/>.
</t>
<t>
The authors also wish to thank
<contact fullname="Bill Baker"/>,
<contact fullname="Greg Marsden"/>,
and
<contact fullname="Matt Benjamin"/>
for their support of this work.
</t>
<t>
The XDR extraction conventions were
first described by the authors of the NFS version 4.1
XDR specification
<xref target="RFC5662" format="default" sectionFormat="of"/>.
<contact fullname="Herbert van den Bergh"/>
suggested the replacement sed script used in this document.
</t>
<t>
Special thanks go to
Transport Area Director
<contact fullname="Magnus Westerlund"/>,
NFSV4 Working Group Chairs
<contact fullname="Spencer Shepler"/>,
and
<contact fullname="Brian Pawlowski"/>,
and
NFSV4 Working Group Secretary
<contact fullname="Thomas Haynes"/>
for their support.
</t>
</section>

</back>

</rfc>
