<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<!-- XML source for the Requirement Wars internet draft document -->

<!-- To generate text with the xml2rfc tool tclsh8.3 xml2rfc.tcl 
     xml2rfc this_file.xml that_file.txt which puts the formatted 
     text into that_file.txt -->

<!-- processing instructions (for a complete list and description,
     see file http://xml.resource.org/authoring/README.html -->

<!-- try to enforce the ID-nits conventions and DTD validity -->

<?rfc strict="yes" ?>

<!-- items used when reviewing the document -->

<?rfc comments="yes" ?>  <!-- controls display of <cref> elements -->
<?rfc inline="yes" ?>    <!-- when no, put comments at end in comments section,
                                otherwise, put inline -->
<?rfc editing="no" ?>   <!-- when yes, insert editing marks -->

<!-- create table of contents (set it options).  
     Note the table of contents may be omitted
     for very short documents --> 

<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>

<!-- choose the options for the references. Some like
     symbolic tags in the references (and citations)
     and others prefer numbers. --> 

<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>

<!-- these two save paper: start new paragraphs from the same page etc. -->

<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<!-- end of list of processing instructions -->

<rfc
    category="info"
    ipr="trust200902"
    docName="draft-hellwig-nfsv4-scsi-layout-nvme-03" >

<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<front>
    <title abbrev="pNFS SCSI Layout for NVMe">
      Using the Parallel NFS (pNFS) SCSI Layout with NVMe
    </title>

    <author fullname="Christoph Hellwig"
            initials="C."
            surname="Hellwig">
      <address>
        <email>hch@lst.de</email>
      </address>
    </author>

    <author fullname="Charles Lever"
	    initials="C."
	    surname="Lever">
      <organization abbrev="Oracle">Oracle Corporation</organization>
      <address>
	<postal>
	  <street/>
	  <city/>
	  <region/>
	  <code/>
	  <country>United States of America</country>
        </postal>
        <email>chuck.lever@oracle.com</email>
      </address>
    </author>

    <author fullname="Sorin Faibish"
	    initials="S."
	    surname="Faibish">
      <organization>Cirrus Data Solutions Inc.</organization>
      <address>
	<postal>
	  <street>11 Selwyn Road</street>
	  <city>Newton</city>
	  <region>MA</region>
	  <code>02461</code>
	  <country>United States of America</country>
        </postal>
        <email>sorin.faibish@cdsi.us.com</email>
      </address>
    </author>

    <author fullname="David L. Black"
	    initials="D."
	    surname="Black">
      <organization>Dell Technologies</organization>
      <address>
	<postal>
	  <street>176 South Street</street>
	  <city>Hopkinton</city>
	  <region>MA</region>
	  <code>01748</code>
	  <country>United States of America</country>
        </postal>
        <email>david.black@dell.com</email>
      </address>
    </author>

    <date year="2022" month="July" day="07"/>

    <area>Transport</area>
    <workgroup>NFSv4</workgroup>
    <keyword>NFSv4</keyword>

    <abstract>
      <t>
        This document explains how to use the Parallel Network File System
	(pNFS) SCSI Layout Type with transports using the NVMe or NVMe
	over Fabrics protocol.
      </t>
    </abstract>
</front>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->
<middle>

<section anchor="sec:intro" title="Introduction">
  <t>
    The pNFS Small Computer System Interface (SCSI) layout
    <xref target="RFC8154" /> is a layout type
    that allows NFS clients to directly perform I/O to block storage devices
    while bypassing the Metadata Server (MDS).  It is specified by using
    concepts from the SCSI protocol family for the data path to the storage
    devices.
  </t>
  <t>
    This documents explains how to access NVM Command set Namespaces
    <xref target="NVME-NVM" /> exported by NVMe Controllers implementing
    the NVMe Base specification (<xref target="NVME-BASE" />) using
    the SCSI layout type.

    This document works independent of the underlying transport used by
    the NVMe Controller and thus supports Controllers implementing a
    wide variety of transports, including PCIe Express, RDMA, TCP and
    Fibre Channel.
  </t>
  <t>
    This document does not amend the pNFS SCSI layout document, but
    instead explains how to map the SCSI constructs used in
    the pNFS SCSI layout document to NVMe concepts.
  </t>

  <section anchor="ssc:intro:reqlang" title="Requirements Language">
    <t>
      The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
      "OPTIONAL" in this document are to be interpreted as described in
      BCP 14 <xref target="RFC2119" /> <xref target="RFC8174" /> when, and
      only when, they appear in all capitals, as shown here.
    </t>
  </section>

  <section anchor="ssc:intro:defs" title="General Definitions">
    <t>
      The following definitions are provided for the purpose of providing
      an appropriate context for the reader.
    </t>

    <t>
      <list style='hanging'>
        <t hangText="Client">
          The "client" is the entity that accesses the NFS server's
          resources.  The client may be an application that contains the
          logic to access the NFS server directly.  The client may also be
          the traditional operating system client that provides remote file
          system services for a set of applications.
        </t>

        <t hangText="Server">
          The "server" is the entity responsible for coordinating client
          access to a set of file systems and is identified by a server
          owner.
        </t>
      </list>
    </t>
  </section>
</section>

<section anchor='sec:slm' title='SCSI Layout mapping to NVMe'>
  <t>
    The SCSI layout definition <xref target="RFC8154" /> only references
    few SCSI specific concepts directly.  This document provides a mapping
    from these SCSI concepts to NVM Express concepts that SHOULD be used
    when using the pNFS SCSI layout with NVMe namespaces.
  </t>

  <section anchor='ssc:volident' title='Volume Identification'>
    <t>
      The pNFS SCSI layout uses the Device Identification VPD page (page code
      0x83) from <xref target="SPC5" /> to identify the devices used by
      a layout. Implementations that use NVMe namespaces as storage devices
      map NVMe namespace identifiers to a subset of the identifiers
      that the Device Identification VPD page supports for SCSI logical
      units.
    </t>

    <t>
      To be used as storage devices for the pNFS SCSI layout, NVMe namespaces 
      MUST support either the EUI64 or NGUID value reported in a Namespace
      Identification Descriptor, the I/O Command Set Independent Identify
      Namespace Data Structure, and the Identify Namespace Data Structure,
      NVM Command Set. If available, the NGUID value SHOULD be used as it is
      the larger identifier.
    </t>

    <t>
      Methods based on the Serial Number are not be suitable for unique
      addressing needs and thus MUST NOT be used.
    </t>

    <t>
      The pnfs_scsi_base_volume_info4 structure for an NVMe namespace
      SHALL be constructed as follows:
      <list style='numbers'>
      <t>The "sbv_code_set" field SHALL be set to PS_CODE_SET_BINARY.</t>
      <t>The "pnfs_scsi_designator_type" field SHALL be set to
         PS_DESIGNATOR_EUI64.</t>
      <t>The "sbv_designator" field SHALL contain either the NGUID or
	 the EUI64 identifier for the namespace.  If both NGUID and EUI64
	 identifiers are available, then the NGUID identifier SHOULD be
	 used as it is the larger identifier.</t>
      <!-- XXX: add a reference to the persistent reservation section for
           sbv_pr_key -->
      </list>
   
      RFC 8154 specifies the "sbv_designator" field as an XDR variable length
      opaque&lt;&gt;. The length of that XDR opaque&lt;&gt; value (part of
      its XDR representation) indicates which NVMe identifier is present.
      That length MUST be 16 octets for an NVMe NGUID identifier and
      MUST be 8 octets for an NVMe EUI64 identifier.  All other lengths
      MUST NOT be used with an NVMe namespace.
    </t>
  </section>

  <section anchor='ssc:fencing' title='Client Fencing'>
    <t>
      The SCSI layout uses Persistent Reservations (PRs) to provide client
      fencing.  For this both the MDS and the Clients have to register
      a key with the storage device, and the MDS has to create a
      reservation on the storage device.
    </t>
    <t>
      The following is a full mapping of the required PERSISTENT RESERVE IN
      and PERSISTENT RESERVE OUT SCSI commands <xref target="SPC5" /> to
      NVMe commands which MUST be used when using NVMe namespaces as
      storage devices for the pNFS SCSI layout.
    </t>

    <section anchor='ssc:fencing:keys' title='PRs - Key Registration'>
      <t>
        On NVMe namespaces, reservations keys are registered using the
	Reservation Register command (refer to Section 7.3 of
	<xref target="NVME-BASE" />) with the Reservation Register Action
	(RREGA) field set to 000b (i.e., Register Reservation Key) and
	supplying the reservation key in the New Reservation Key (NRKEY)
	field.
      </t>
      <t>
        Reservation keys are unregistered using the Reservation Register
	command with the Reservation Register Action (RREGA) field set to
	001b (i.e., Unregister Reservation Key) and supplying the reservation
	key in the Current Reservation Key (CRKEY) field.
      </t>
      <t>
	One important difference between SCSI Persistent Reservations
	and NVMe Reservations is that NVMe reservation keys always apply
	to all controllers used by a host (as indicated by the NVMe Host
	Identifier). This behavior is analogous to setting the ALL_TG_PT
	bit when registering a SCSI Reservation key, and is always supported
	by NVMe Reservations, unlike the ALL_TG_PT for which SCSI support is
	inconsistent and cannot be relied upon.

	Registering a reservation key with a namespace creates an
	association between a host and a namespace. A host that is a
	registrant of a namespace may use any controller with which that
	host is associated (i.e., that has the same Host Identifier,
	refer to Section 5.27.1.25 of <xref target="NVME-BASE" />)
	to access that namespace as a registrant.
      </t>
    </section>

    <section anchor='ssc:fencing:reg'
	     title='PRs - MDS Registration and Reservation'>
      <t>
	Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDS
	needs to prepare the volume for fencing using PRs. This is done by
	registering the reservation generated for the MDS with the device (see
	<xref target="ssc:fencing:keys" />) followed by a Reservation Acquire
	command (refer to Section 7.2 of <xref target="NVME-BASE" />) with
	the Reservation Acquire Action (RACQA) field set to 000b (i.e., Acquire)
	and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive Access
	- Registrants Only Reservation).
      </t>
    </section>

    <section anchor='ssc:fenceaction' title='Fencing Action'>
      <t>
	In case of a non-responding client, the MDS fences the client by
	executing a Reservation Acquire command (refer to section 7.2 of
	<xref target="NVME-BASE" />), with the Reservation Acquire Action
	(RACQA) field to 001b (i.e., Preempt) or 010b (i.e., Preempt and
	Abort), the Current Reservation Key (CRKEY) field set to the
	server's reservation key, the Preempt Reservation Key (PRKEY) field
	set to the reservation key associated with the non-responding client
	and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive
	Access - Registrants Only Reservation).

	The client can distinguish I/O errors due to fencing from other
	errors based on the Reservation Conflict NVMe status code.
      </t>
    </section>

    <section anchor='ssc:recovery' title='Client Recovery after a Fence Action'>
      <t>
        If an NVMe command issued by the client to the storage device returns
	a non-retryable error (refer to the DNR bit defined in Figure 92 in
	<xref target="NVME-BASE" />), the client MUST commit all layouts that
	use the storage device through the MDS, return all outstanding layouts
	for the device, forget the device ID, and unregister the reservation
	key.
      </t>
    </section>
  </section>

  <section anchor='ssc:caches' title='Volatile write caches'>
    <t>
      For NVMe controllers a volatile write cache is enabled if bit 0 of the
      Volatile Write Cache (VWC) field in the Identify Controller Data
      Structure, I/O Command Set Independent (see Figure 275 in
      <xref target="NVME-BASE" />) is set and
      the Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in
      the Volatile Write Cache Feature (Feature Identifier 06h)
      (see Section 5.27.1.4 <xref target="NVME-BASE" />) is set.

      If a volatile write cache is enabled on an NVMe namespace used as a
      storage device for the pNFS SCSI layout, the pNFS server (MDS) MUST
      use the NVMe FLUSH command to flush the volatile write cache to
      stable storage before the LAYOUTCOMMIT operation returns
    </t>
  </section>
</section>

<section anchor="sec:security" title="Security Considerations">
  <t>
    NFSv4 clients access NFSv4 metadata servers using the NFSv4
    protocol. The security considerations generally described in
    <xref target="RFC8881" /> apply to a client's interactions with
    the metadata server. However, NFSv4 clients and servers access
    NVMe storage devices at a lower layer than NFSv4. NFSv4 and
    RPC security is not directly applicable to I/O to data servers
    using NVMe.
  </t>
  <t>
    pNFS with an NVMe layout can be used with NVMe transports
    (e.g., NVMe over PCIe  <xref target="NVME-PCIE" />) that provide
    essentially no additional security functionality. Or,
    pNFS may be used with storage protocols such as NVMe over TCP
    <xref target="NVME-TCP" /> that can provide significant transport
    layer security.
  </t>
  <t>
    It is the responsibility of those administering and deploying
    pNFS with an NVMe layout to ensure that appropriate protection is
    deployed to that protocol.
    When using IP-based storage protocols such as NVMe over TCP, data
    confidentiality and integrity SHOULD be provided for traffic between
    pNFS clients and NVMe storage devices by using a secure communication
    protocol such as TLS  <xref target="RFC8446" />.  For NVMe over TCP,
    TLS SHOULD be used as described in <xref target="NVME-TCP" /> to
    protect traffic between pNFS clients and NVMe namespaces used as
    storage devices.
  </t>
  <t>
    Physical security is a common means for protocols not based on IP.
    In environments where the security requirements for the storage
    protocol cannot be met, pNFS with an NVMe layout SHOULD NOT be
    deployed.
  </t>
  <t>
    When security is available for the data server storage protocol,
    it is generally at a different granularity and with a different
    notion of identity than NFSv4 (e.g., NFSv4 controls user access
    to files, and NVMe controls initiator access to volumes).  As
    with pNFS with the block layout type <xref target="RFC5663" />,
    the pNFS client is responsible for enforcing appropriate
    correspondences between these security layers. In environments
    where the security requirements are such that client-side
    protection from access to storage outside of the layout is not
    sufficient, pNFS with a SCSI layout on a NVMe namespace SHOULD
    NOT be deployed.
  </t>
  <t>
    As with other block-oriented pNFS layout types, the metadata server
    is able to fence off a client's access to the data on an NVMe namespace
    used as a storage device.  If a metadata server revokes a layout, the
    client's access MUST be terminated at the storage devices via fencing
    as specified in <xref target="ssc:fencing" />.  The client has a
    subsequent opportunity to acquire a new layout.
  </t>
</section>

<section anchor="sec:iana" title="IANA Considerations">
  <t>
   The document does not require any actions by IANA.
  </t>
</section>

</middle>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<back>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<references title="Normative References">
  <reference anchor='RFC2119'>
    <front>
      <title abbrev='RFC Key Words'>Key words for use in RFCs to Indicate Requirement Levels</title>
      <author initials='S.' surname='Bradner' fullname='Scott Bradner'>
        <organization>Harvard University</organization>
	<address>
	  <postal>
	    <street>1350 Mass. Ave.</street>
	    <street>Cambridge</street>
	    <street>MA 02138</street>
	  </postal>
	  <phone>- +1 617 495 3864</phone>
	  <email>sob@harvard.edu</email>
	</address>
      </author>
      <date year='1997' month='March' />
    </front>
  </reference>

  <reference  anchor='RFC5663' target='https://www.rfc-editor.org/info/rfc5663'>
    <front>
      <title>Parallel NFS (pNFS) Block/Volume Layout</title>
      <author initials='D.' surname='Black' fullname='D. Black'><organization /></author>
      <author initials='S.' surname='Fridella' fullname='S. Fridella'><organization /></author>
      <author initials='J.' surname='Glasgow' fullname='J. Glasgow'><organization /></author>
      <date year='2010' month='January' />
    </front>
    <seriesInfo name='RFC' value='5663'/>
    <seriesInfo name='DOI' value='10.17487/RFC5663'/>
  </reference>

  <reference anchor='RFC8154'>
    <front>
       <title>Parallel NFS (pNFS) Small Computer System Interface (SCSI) Layout</title>
       <author initials='C.' surname='Hellwig' fullname='Christoph Hellwig'/>
       <date month='May' year='2017'/>
    </front>
  </reference>

  <reference  anchor='RFC8174' target='https://www.rfc-editor.org/info/rfc8174'>
    <front>
      <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
      <author initials='B.' surname='Leiba' fullname='B. Leiba'><organization /></author>
      <date year='2017' month='May' />
      <abstract><t>RFC 2119 specifies common key words that may be used in protocol  specifications.  This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the  defined special meanings.</t></abstract>
    </front>
    <seriesInfo name='BCP' value='14'/>
    <seriesInfo name='RFC' value='8174'/>
    <seriesInfo name='DOI' value='10.17487/RFC8174'/>
  </reference>

  <reference  anchor='RFC8881' target='https://www.rfc-editor.org/info/rfc8881'>
    <front>
      <title>Network File System (NFS) Version 4 Minor Version 1 Protocol</title>
      <author initials='D.' surname='Noveck' fullname='D. Noveck' role='editor'><organization /></author>
      <author initials='C.' surname='Lever' fullname='C. Lever'><organization /></author>
      <date year='2020' month='August' />
    </front>
    <seriesInfo name='RFC' value='8881'/>
    <seriesInfo name='DOI' value='10.17487/RFC8881'/>
  </reference>

  <reference anchor='SPC5'>
    <front>
      <title>SCSI Primary Commands-5</title>
      <author>
         <organization>INCITS Technical Committee T10</organization>
      </author>
      <date year="2019"/>
    </front>
    <seriesInfo name="ANSI INCITS" value="502-2019"/>
  </reference>

  <reference anchor='NVME-BASE'>
    <front>
      <title>NVM Express Base Specification, Revision 2.0</title>
      <author>
         <organization>NVM Express, Inc.</organization>
      </author>
      <date month="May" year="2021"/>
    </front>
  </reference>

  <reference anchor='NVME-NVM'>
    <front>
      <title>NVM Express NVM Command Set Specification, Revision 1.0</title>
      <author>
         <organization>NVM Express, Inc.</organization>
      </author>
      <date month="May" year="2021"/>
    </front>
  </reference>

  <reference anchor='NVME-PCIE'>
    <front>
      <title>NVMe over PCIe Transport Specification, Revision 1.0</title>
      <author>
         <organization>NVM Express, Inc.</organization>
      </author>
      <date month="May" year="2021"/>
    </front>
  </reference>

  <reference anchor='NVME-TCP'>
    <front>
      <title>NVM Express TCP Transport Specification, Revision 1.0</title>
      <author>
         <organization>NVM Express, Inc.</organization>
      </author>
      <date month="May" year="2021"/>
    </front>
  </reference>
</references>

<references title="Informative References">
  <reference  anchor='RFC8446' target='https://www.rfc-editor.org/info/rfc8446'>
    <front>
      <title>The Transport Layer Security (TLS) Protocol Version 1.3</title>
      <author initials='E.' surname='Rescorla' fullname='E. Rescorla'><organization /></author>
      <date year='2018' month='August' />
    </front>
    <seriesInfo name='RFC' value='8446'/>
    <seriesInfo name='DOI' value='10.17487/RFC8446'/>
  </reference>
</references>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

</back>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

</rfc>
