<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt"?>

<rfc
 category="std"
 docName="draft-ietf-nfsv4-layrec-01"
 ipr="trust200902"
 obsoletes=""
 scripts="Common,Latin"
 sortRefs="true"
 submissionType="IETF"
 symRefs="true"
 tocDepth="3"
 tocInclude="true"
 version="3"
 xml:lang="en">

<front>
  <title abbrev="LAYOUT_RECOVERY">
    Reporting of Errors via LAYOUTRETURN in NFSv4.2
  </title>
  <seriesInfo name="Internet-Draft" value="draft-ietf-nfsv4-layrec-01"/>
  <author fullname="Thomas Haynes" initials="T." surname="Haynes">
    <organization abbrev="Hammerspace">Hammerspace</organization>
    <address>
      <email>loghyr@hammerspace.com</email>
    </address>
  </author>
  <author fullname="Trond Myklebust" initials="T." surname="Myklebust">
    <organization abbrev="Hammerspace">Hammerspace</organization>
    <address>
      <email>trondmy@hammerspace.com</email>
    </address>
  </author>
  <date year="2024" month="March" day="21"/>
  <area>Transport</area>
  <workgroup>Network File System Version 4</workgroup>
  <keyword>NFSv4</keyword>
  <abstract>
    <t>
      The Parallel Network File System (pNFS) allows
      for a file's metadata (MDS) and data (DS) to be on different
      servers. When the metadata server is restarted, the client
      can still modify the data file component.  During the
      recovery phase of startup, the metadata server and the
      data servers work together to recover state (which files
      are open, last modification time, size, etc). If the client
      has not encountered errors with the data files, then the state can be
      recovered, avoiding resilvering of the data files. With any
      errors, there is no means by which the client can report errors to the
      metadata server. As such, the metadata server has to
      assume that file needs resilvering. This document presents an
      extension to RFC8435 to allow the client to update the metadata
      and avoid the resilvering.
    </t>
  </abstract>

  <note removeInRFC="true">
    <t>
      Discussion of this draft takes place
      on the NFSv4 working group mailing list (nfsv4@ietf.org),
      which is archived at
      <eref target="https://mailarchive.ietf.org/arch/browse/nfsv4/"/>.
      Working Group information can be found at
      <eref target="https://datatracker.ietf.org/wg/nfsv4/about/"/>.
    </t>
  </note>
</front>

<middle>

<section anchor="sec_intro" numbered="true" removeInRFC="false" toc="default">
  <name>Introduction</name>
  <t>
    In the Network File System version4 (NFSv4) with a Parallel NFS
    (pNFS) Flexible File Layout (<xref target="RFC8435" format="default"
    sectionFormat="of"/>) server, during recovery after a restart,
    there is no mechanism for the client
    to inform the metadata server about an error which occurred during a
    WRITE (see Section 18.32 of <xref target="RFC8881" format="default"
    sectionFormat="of"/>) operation to the data servers in the period of
    the outage.
  </t>

  <t>
    Using the process detailed in <xref target="RFC8178" format="default"
    sectionFormat="of"/>, the revisions in this document become an
    extension of NFSv4.2 <xref target="RFC7862" format="default"
    sectionFormat="of"/>. They are built on top of the external data
    representation (XDR) <xref target="RFC4506" format="default"
    sectionFormat="of"/> generated from <xref target="RFC7863"
    format="default" sectionFormat="of"/>.
  </t>

  <section anchor="sec_defs" numbered="true" removeInRFC="false" toc="default">
    <name>Definitions</name>
    <t>
      See Section 1.1 of <xref target="RFC8435" format="default"
      sectionFormat="of"/> for a more complete set of definitions.
    </t>

    <dl newline="false" spacing="normal">
      <dt>(file) data:</dt>
      <dd>
        that part of the file system object that contains the
        data to be read or written.  It is the contents of the object
        rather than the attributes of the object.
      </dd>

      <dt>data server (DS):</dt>
      <dd>
        a pNFS server that provides the file's data when
        the file system object is accessed over a file-based protocol.
      </dd>

      <dt>(file) metadata:</dt>
      <dd>
        the part of the file system object that contains
        various descriptive data relevant to the file object, as opposed
        to the file data itself.  This could include the time of last
        modification, access time, EOF position, etc.
      </dd>

      <dt>metadata server (MDS):</dt>
      <dd>
        the pNFS server that provides metadata information for a file system object.
      </dd>
    </dl>
  </section>
  <section numbered="true" removeInRFC="false" toc="default">
    <name>Requirements Language</name>
    <t>
      The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
      "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
      NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>",
      "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
      "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this
      document are to be interpreted as described in BCP&nbsp;14 <xref
      target="RFC2119" format="default" sectionFormat="of"/> <xref
      target="RFC8174" format="default" sectionFormat="of"/> when,
      and only when, they appear in all capitals, as shown here.
    </t>
  </section>
</section>

<section anchor="layout_state_recovery" numbered="true" removeInRFC="false" toc="default">
  <name>Layout State Recovery</name>
  <t>
    When a metadata server restarts, clients are provided a grace period where
    they are allowed to recover any state that
    they had established. With open files, the client can send an OPEN (see
    Section 18.16 of <xref target="RFC8881" format="default" sectionFormat="of"/>)
    operation with a claim type of CLAIM_PREVIOUS (see Section 9.11 of
    <xref target="RFC8881" format="default" sectionFormat="of"/>). The client
    uses the RECLAIM_COMPLETE (see Section 18.51
    of <xref target="RFC8881" format="default" sectionFormat="of"/>) operation
    to notify the metadata server that it is done reclaiming state.
  </t>
  <t>
    The NFSv4 Flexible File Layout Type allows for the client to mirror files
    (see Section 8 of <xref target="RFC8435" format="default" sectionFormat="of"/>).
    With client side mirroring, it is important for the client to inform
    the metadata server of any I/O errors encountered with one of the mirrors.
    This is the only way for the metadata server to determine one or more
    of the mirrors is corrupt and then repair the mirrors via resilvering.
    The client can use LAYOUTRETURN (see
    Section 18.44 of <xref target="RFC8881" format="default" sectionFormat="of"/>)
    and the ff_ioerr4 (see Section 9.1.1 of <xref target="RFC8435" format="default" sectionFormat="of"/>) structure to inform
    the metadata server of I/O errors.
  </t>
  <t>
    A problem is that if the metadata server restarts and the client has
    errors it needs to report, it can not do so. The LAYOUTRETURN needs
    a layout stateid to proceed and there is no way for the client to
    recover layout state. As such, clients have no choice but to not recover
    files with I/O errors. In turn, the metadata server <bcp14>MUST</bcp14>
    assume that the mirrors are inconsistent and pick one for resilvering.
    It is a <bcp14>MUST</bcp14> because even if the metadata server can
    determine that the client did modify data during the outage, it <bcp14>MUST NOT</bcp14>
    assume those modifications were consistent.
  </t>
  <t>
    If the server were to allow the client to use the anonymous stateid
    of all zeros
    (see Section 8.2.3 of <xref target="RFC8881" format="default" sectionFormat="of"/>) for
    lrf_stateid in LAYOUTRETURN (see Section 18.44.1 of
    <xref target="RFC8881" format="default" sectionFormat="of"/>),
    then the client could inform the metadata server of errors
    encountered. That in turn would allow the metadata server to
    accurately resilver the file by picking the correct mirror(s).
  </t>
  <t>
    There are two error scenarios that can occur:
  </t>
  <dl newline="false" spacing="normal">
    <dt>During the grace period:</dt>
    <dd>
      If the client were to send any lrf_stateid
      in the LAYOUTRETURN other than the anonymous stateid 
      of all zeros, then the metadata server would respond with an error of
      NFS4ERR_GRACE (see Section of 15.1.9.2 <xref target="RFC8881" format="default" sectionFormat="of"/>).
    </dd>
    <dt>After the grace period:</dt>
    <dd>
      If the client were to send any lrf_stateid
      in the LAYOUTRETURN with the anonymous stateid 
      of all zeros, then the metadata server would respond with an error of
      NFS4ERR_NO_GRACE (see Section 15.1.9.3 of <xref target="RFC8881" format="default" sectionFormat="of"/>).
    </dd>
  </dl>
  <t>
    Also, when the metadata server builds the reply to the LAYOUTRETURN,
    it <bcp14>MUST NOT</bcp14> bump the seqid of the lorr_stateid.
  </t>
  <t>
    The metadata server <bcp14>MUST NOT</bcp14> resilver a file if there
    are clients with outstanding layouts with iomode of
    LAYOUTIOMODE4_RW. Whether the metadata server prevents all I/O to
    the file until the resilvering is done or forces all I/O to go through
    the metadata server or allows a proxy server to update the new data
    file as it is being reslivered is all an implementation choice. The
    constraint is that the metadata server is responsible for the
    reconstruction of the data file and for the consistency of the
    mirrors.
    The metadata server <bcp14>MUST NOT</bcp14> have been resilvering the file
    such that it has a different layout (set of mirror instances) than
    the client before the restart of the metadata server.  Further, the
    metadata server <bcp14>MUST NOT</bcp14> start a new resilvering of
    the file during the grace period unless there are no outstanding layouts with iomode of
    LAYOUTIOMODE4_RW.
  </t>
  <t>
    If the metadata server detects that the layout being returned in
    the LAYOUTRETURN does not match the current mirror instances found
    for the file, then it should ignore the LAYOUTRETURN and resilver the
    file in question.
  </t>
  <t>
    Finally, the metadata server <bcp14>MUST</bcp14> determine that any files
    which are neither explicitly recovered with a CLAIM_PREVIOUS nor
    have a reported error via a LAYOUTRETURN, have to be
    resilvered. The client has most likely restarted and lost any state.
  </t>
</section>

<section anchor="sec_security" numbered="true" removeInRFC="false" toc="default">
  <name>Security Considerations</name>
  <t>
    There are no new security considerations beyond those in
    <xref target="RFC7862" format="default" sectionFormat="of"/>.
  </t>
</section>

<section anchor="sec_iana" numbered="true" removeInRFC="false" toc="default">
  <name>IANA Considerations</name>
  <t>
    IANA should use the current document (RFC-TBD) as the reference for the new entries.
  </t>
</section>

</middle>

<back>

<references>
  <name>References</name>

  <references>
  <name>Normative References</name>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml"/>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
       href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/>

  </references>
</references>

<section numbered="true" removeInRFC="false" toc="default">
      <name>Acknowledgments</name>
      <t>
        Tigran Mkrtchyan, Jeff Layton, and Rick Macklem provided reviews of the document.
      </t>
    </section>

</back>

</rfc>
