<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?>

<rfc
 category='std'
 docName='draft-haynes-nfsv4-erasure-encoding-03'
 ipr='trust200902'
 obsoletes=''
 scripts='Common,Latin'
 sortRefs='true'
 submissionType='IETF'
 symRefs='true'
 tocDepth='3'
 tocInclude='true'
 version='3'
 consensus='true'
 xml:lang='en'>

<front>
  <title abbrev='erasure encoding'>
    Erasure Encoding of Files in NFSv4.2
  </title>
  <seriesInfo name='Internet-Draft' value='draft-haynes-nfsv4-erasure-encoding-03'/>
  <author fullname='Thomas Haynes' initials='T.' surname='Haynes'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>loghyr@gmail.com</email>
    </address>
  </author>
  <date year='2024' month='November' day='05'/>
  <area>Transport</area>
  <workgroup>Network File System Version 4</workgroup>
  <keyword>NFSv4</keyword>
  <abstract>
    <t>
      Parallel NFS (pNFS) allows a separation between the metadata (onto
      a metadata server) and data (onto a storage device) for a file.
      The Flexible File Version 2 Layout Type is defined in this document
      as an extension to pNFS that allows the use of storage devices that
      require only a limited degree of interaction with the metadata
      server and use already-existing protocols.  Data replication is
      also added to provide integrity.
    </t>
  </abstract>

  <note removeInRFC='true'>
    <t>
      Discussion of this draft takes place
      on the NFSv4 working group mailing list (nfsv4@ietf.org),
      which is archived at
      <eref target='https://mailarchive.ietf.org/arch/browse/nfsv4/'/>.
      Working Group information can be found at
      <eref target='https://datatracker.ietf.org/wg/nfsv4/about/'/>.
    </t>
  </note>

  <note removeInRFC='true'>
    <t>
      This draft starts sparse and will be filled in as details are
      ironed out. For example, WRITE_BLOCK4 in <xref target='WRITE_BLOCK4' />
      is presented as being WRITE4 (see Section 18.32 of <xref
      target='RFC8881' format='default' sectionFormat='of' />) plus
      some semantic changes. In the first draft, we simply explain the
      semantics changes. As these are accepted by the knowledgeable reviewers,
      we will flesh out the WRITE_BLOCK4 section to include sub-sections more
      akin to 18.32.3 and 18.32.4 of <xref
      target='RFC8881' format='default' sectionFormat='of' />.
    </t>
    <t>
      Except where called out, all the semantics of the Flexible File Version 1 Layout
      Type presented in <xref target='RFC8435'
      format='default' sectionFormat='of' /> still apply. This new
      version extends it and does not replace it.
    </t>
  </note>
</front>

<middle>

  <section anchor='sec_intro' numbered='true' removeInRFC='false' toc='default'>
    <name>Introduction</name>
    <t>
      In Parallel NFS (pNFS) (see Section 12 of
      <xref target='RFC8881' format='default' sectionFormat='of' />),
      the metadata server returns layout type
      structures that describe where file data is located.  There are
      different layout types for different storage systems and methods
      of arranging data on storage devices.  <xref target='RFC8435'
      format='default' sectionFormat='of' /> defined the Flexible
      File Version 1 Layout Type used with file-based data servers that are
      accessed using the NFS protocols: NFSv3 <xref target='RFC1813'
      format='default' sectionFormat='of' />, NFSv4.0 <xref
      target='RFC7530' format='default' sectionFormat='of' />, NFSv4.1
      <xref target='RFC8881' format='default' sectionFormat='of' />, and
      NFSv4.2 <xref target='RFC7862' format='default' sectionFormat='of' />.
    </t>

    <t>
      The Client Side Mirroring (see Section 8 of <xref target='RFC8435'
      format='default' sectionFormat='of'/>), introduced with the first
      version of the Flexible File Layout Type, provides for replication
      of data but does not provide for integrity of data. In the event
      of an error, an user would be able to repair the file by silvering
      the mirror contents. I.e., they would pick one of the mirror
      instances and replicate it to the other instance locations.
    </t>

    <t>
      However, lacking integrity checks, silent corruptions are not able
      to be detected and the choice of what constitutes the good copy
      is difficult.  This document updates the Flexible File Layout Type
      to version 2 by providing data integrity for erasure encoding.
      It introduces new variants of COMMIT4 (see Section 18.3 of <xref
      target='RFC8881' format='default' sectionFormat='of' />) , READ4
      (see Section 18.22 of <xref target='RFC8881' format='default'
      sectionFormat='of' />) , and WRITE4 (see Section 18.32 of <xref
      target='RFC8881' format='default' sectionFormat='of' />) to allow
      for the transmission of integrity checking.
    </t>

    <t>
      Using the process detailed in <xref target='RFC8178' format='default'
      sectionFormat='of'/>, the revisions in this document become an
      extension of NFSv4.2 <xref target='RFC7862' format='default'
      sectionFormat='of'/>. They are built on top of the external data
      representation (XDR) <xref target='RFC4506' format='default'
      sectionFormat='of'/> generated from <xref target='RFC7863'
      format='default' sectionFormat='of'/>.
    </t>

    <section anchor='sec_defs' numbered='true' removeInRFC='false' toc='default'>
      <name>Definitions</name>
      <dl newline='false' spacing='normal'>
        <dt>block:</dt>
        <dd>
          One of the resulting blocks to be exchanged with a data
          server after a transformation has been applied to
          a data block. Note that the resulting block
          may be a different size than the data block.
        </dd>

        <dt>Client Side Mirroring:</dt>
        <dd>
          A file based replication method where copies are maintained
          in parallel.
        </dd>

        <dt>data block:</dt>
        <dd>
          A block of data in the client's cache for a file.
        </dd>

        <dt>Erasure Encoding:</dt>
        <dd>
          A data protection scheme where a block of data is replicated
          into fragments and additional redundant fragments are added
          to achieve parity. The new blocks are stored in different
          locations.
        </dd>

        <dt>Client Side Erasure Encoding:</dt>
        <dd>
          A file based integrity method where copies are maintained
          in parallel.
        </dd>

        <dt>consistency of payload:</dt>
        <dd>
          A payload is consistent when all contained blocks have
          the same owner, i.e., they share the same writing
          client and transaction id.
        </dd>

        <dt>integrity of data:</dt>
        <dd>
          Data integrity refers to the accuracy, consistency, and
          reliability of data throughout its life cycle.
        </dd>

        <dt>payload:</dt>
        <dd>
          The set of metadata header and transformed blocks
          generate per data block by the erasure encoding
          type.  Note that the resulting blocks might
          be of type active, parity, spare, or repair.
        </dd>

        <dt>replication of data:</dt>
        <dd>
          Data replication is making and storing multiple copies
          of data in different locations.
        </dd>

        <dt>write hole:</dt>
        <dd>
          A write hole is a data corruption scenario where either
          two clients are trying to write to the same block or
          one client is overwriting an existing block of data.
        </dd>
      </dl>
    </section>

    <section numbered='true' removeInRFC='false' toc='default'>
      <name>Requirements Language</name>
      <t>
        The key words '<bcp14>MUST</bcp14>', '<bcp14>MUST NOT</bcp14>',
        '<bcp14>REQUIRED</bcp14>', '<bcp14>SHALL</bcp14>', '<bcp14>SHALL
        NOT</bcp14>', '<bcp14>SHOULD</bcp14>', '<bcp14>SHOULD NOT</bcp14>',
        '<bcp14>RECOMMENDED</bcp14>', '<bcp14>NOT RECOMMENDED</bcp14>',
        '<bcp14>MAY</bcp14>', and '<bcp14>OPTIONAL</bcp14>' in this
        document are to be interpreted as described in BCP 14 <xref
        target='RFC2119' format='default' sectionFormat='of'/> <xref
        target='RFC8174' format='default' sectionFormat='of'/> when,
        and only when, they appear in all capitals, as shown here.
      </t>
    </section>
  </section>

  <section numbered='true' removeInRFC='false' toc='default'>
    <name>Flexible File Version 2 Layout Type</name>
    <t>
      In order to introduce erasure encoding to pNFS, a new layout type
      of LAYOUT4_FLEX_FILES_V2 needs to be defined.  While we could
      define a new layout type per erasure encoding type, there exist
      use cases where multiple erasure encoding types exist in the same layout.
    </t>
    <t>
      The original layouttype4 introduced in <xref target='RFC8881'
      format='default' sectionFormat='of' /> is modified to as in <xref
      target='code_layout4' />.
    </t>

    <figure anchor='code_layout4'>
      <sourcecode type='xdr'>
       enum layouttype4 {
           LAYOUT4_NFSV4_1_FILES   = 1,
           LAYOUT4_OSD2_OBJECTS    = 2,
           LAYOUT4_BLOCK_VOLUME    = 3,
           LAYOUT4_FLEX_FILES      = 4,
           LAYOUT4_FLEX_FILES_V2   = 5
       };

       struct layout_content4 {
           layouttype4             loc_type;
           opaque                  loc_body&lt;&gt;;
       };

       struct layout4 {
           offset4                 lo_offset;
           length4                 lo_length;
           layoutiomode4           lo_iomode;
           layout_content4         lo_content;
       };
      </sourcecode>
    </figure>

    <t>
      This document defines structures associated with the layouttype4
      value LAYOUT4_FLEX_FILES_V2.  <xref target='RFC8881' format='default'
      sectionFormat='of' /> specifies the loc_body structure as an XDR
      type 'opaque'.  The opaque layout is uninterpreted by the generic
      pNFS client layers but is interpreted by the Flexible File Version 2 Layout
      Type implementation.  This section defines the structure of this
      otherwise opaque value, ffv2_layout4.
    </t>

    <section anchor='ffv2_encoding_type' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_encoding_type</name>

      <figure anchor='code_ffv2_encoding_type'>
        <sourcecode type='xdr'>
   /// enum ffv2_encoding_type {
   ///     FFV2_ENCODING_MIRRORED       = 0x1;
   /// };
        </sourcecode>
      </figure>

      <t>
        The ffv2_encoding_type (see <xref target='code_ffv2_encoding_type' />)
        encompasses a new IANA registry for 'Flex Files V2 Erasure
        Encoding Type Registry' (see <xref target='sec_iana_encoding' />).
        I.e., instead of defining a new Layout Type for each
        Erasure Encoding, we define a new Erasure Encoding Type.
        Except for FFV2_ENCODING_MIRRORED, each of the types
        is expected to employ the new operations in this document.
      </t>

      <t>
        FFV2_ENCODING_MIRRORED offers replication of data and
        not integrity of data. As such, it does not need operations
        like WRITE_BLOCK4 (see <xref target='WRITE_BLOCK4' />).
      </t>
    </section>

    <section anchor='ff_flags4' numbered='true' removeInRFC='false' toc='default'>
      <name>ff_flags4</name>
      <figure anchor='code_ff_flags4'>
        <sourcecode type='xdr'>
   const FF_FLAGS_NO_LAYOUTCOMMIT4   = 0x00000001;
   const FF_FLAGS_NO_IO_THRU_MDS    = 0x00000002;
   const FF_FLAGS_NO_READ_IO        = 0x00000004;
   const FF_FLAGS_WRITE_ONE_MIRROR  = 0x00000008;
   typedef uint32_t            ff_flags4;
        </sourcecode>
      </figure>
      <t>
        ff_flags4 is defined as in Section 5.1 of <xref target='RFC8435'
        format='default' sectionFormat='of'/> and is shown
        in <xref target='code_ff_flags4' /> for reference.
      </t>
    </section>

    <section anchor='ffv2_file_info4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_file_info4</name>
      <figure anchor='code_ffv2_file_info4'>
        <sourcecode type='xdr'>
   /// struct ffv2_file_info4 {
   ///     stateid4                fffi_stateid;
   ///     nfs_fh4                 fffi_fh_vers;
   /// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_file_info4 is a new structure to help with
        the stateid issue discussed in Section 5.1
        of <xref target='RFC8435'
        format='default' sectionFormat='of'/>. I.e., in
        version 1 of the Flexible File Layout Type, there
        was the singleton ffds_stateid combined with the
        ffds_fh_vers array. I.e., each NFSv4 version has
        its own stateid. In <xref target='code_ffv2_file_info4' />,
        each NFSv4 file handle has a one-to-one
        correspondence to a stateid.
      </t>
    </section>

    <section anchor='ffv2_ds_flags4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_ds_flags4</name>
      <figure anchor='code_ffv2_ds_flags4'>
        <sourcecode type='xdr'>
   /// const FFV2_DS_FLAGS_ACTIVE        = 0x00000001;
   /// const FFV2_DS_FLAGS_SPARE         = 0x00000002;
   /// const FFV2_DS_FLAGS_PARITY        = 0x00000004;
   /// const FFV2_DS_FLAGS_REPAIR        = 0x00000008;
   /// typedef uint32_t            ffv2_ds_flags4;
        </sourcecode>
      </figure>
      <t>
        The ffv2_layout4 (in <xref target='code_ffv2_ds_flags4' />) flags detail the state of the data servers.
        With Erasure Encoding algorithms, there are both Systematic and Non-Systematic
        approaches. In the Systematic, the bits for integrity are placed amoungst the
        resulting transformed block. Such an implementation would typically see
        FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE data servers. The FFV2_DS_FLAGS_SPARE
        ones allow the client to repair a payload with enaging the metadata server.
        I.e., if one of the FFV2_DS_FLAGS_ACTIVE did not respond to a WRITE_BLOCK4,
        the client could fail the block to the FFV2_DS_FLAGS_SPARE data server.
      </t>
      <t>
        With the Non-Systematic approach, the data and integrity live on different
        data servers. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE
        and FFV2_DS_FLAGS_PARITY data servers. If the implementation wanted to allow
        for local repair, it would also use FFV2_DS_FLAGS_SPARE. Note that with
        a Non-Systematic approach, it is possible to update parts of the blocks,
        see <xref target='update_header' />.
      </t>
      <t>
        See  <xref target='Plank97' format='default' sectionFormat='of'/> for further
        reference to storage layouts for encoding.
      </t>
    </section>

    <section anchor='ffv2_data_server4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_data_server4</name>
      <figure anchor='code_ffv2_data_server4'>
        <sourcecode type='xdr'>
   /// struct ffv2_data_server4 {
   ///     deviceid4               ffds_deviceid;
   ///     uint32_t                ffds_efficiency;
   ///     ffv2_file_info4         ffds_file_info&lt;&gt;;
   ///     fattr4_owner            ffds_user;
   ///     fattr4_owner_group      ffds_group;
   ///     ffv2_ds_flags4          ffds_flags;
   /// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_data_server4 (in <xref target='code_ffv2_data_server4' />) describes
        a data file and how to access it via the different NFS protocols.
      </t>
    </section>

    <section anchor='ffv2_encoding_type_data' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_encoding_type_data</name>
      <figure anchor='code_ffv2_encoding_type_data'>
        <sourcecode type='xdr'>
   /// union ffv2_encoding_type_data switch
   ///         (ffv2_encoding_type fetd_encoding) {
   ///     case FFV2_ENCODING_MIRRORED:
   ///         void;
   /// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_encoding_type_data (in <xref target='code_ffv2_encoding_type_data' />) describes
        erasure encoding type specific fields. I.e., this is how the encoding type can
        communicate the need for counts of active, spare, parity, and repair types
        of blocks.
      </t>
    </section>

    <section anchor='ffv2_mirror4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_mirror4</name>
      <figure anchor='code_ffv2_mirror4'>
        <sourcecode type='xdr'>
   /// struct ffv2_mirror4 {
   ///     ffv2_data_server4       ffm_data_servers&lt;&gt;;
   ///     ffv2_encoding_type_data ffm_encoding_type_data;
   /// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_mirror4 (in <xref target='code_ffv2_mirror4' />) describes
        the Flexible File Layout Version 2 specific fields.
      </t>
    </section>

    <section anchor='ffv2_layout4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_layout4</name>
      <figure anchor='code_ffv2_layout4'>
        <sourcecode type='xdr'>
   /// struct ffv2_layout4 {
   ///     length4                 ffl_stripe_unit;
   ///     ffv2_mirror4            ffl_mirrors&lt;&gt;;
   ///     ff_flags4               ffl_flags;
   ///     uint32_t                ffl_stats_collect_hint;
   /// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_layout4 (in <xref target='code_ffv2_layout4' />) describes
        the Flexible Files Layout Version 2.
      </t>
    </section>

    <section anchor='ffv2_layouthint4' numbered='true' removeInRFC='false' toc='default'>
      <name>ffv2_layouthint4</name>
      <figure anchor='code_ffv2_layouthint4'>
        <sourcecode type='xdr'>
/// union ffv2_mirrors_hint switch (ffv2_encoding_type ffmh_type) {
///     case FFV2_ENCODING_MIRRORED:
///         void;
/// };
///
/// struct ffv2_layouthint4 {
///     ffv2_encoding_type fflh_supported_types&lt;&gt;;
///     ffv2_mirrors_hint fflh_mirrors_hint;
/// };
        </sourcecode>
      </figure>
      <t>
        The ffv2_layouthint4 (in <xref target='code_ffv2_layouthint4' />) describes
        the layout_hint (see Section 5.12.4 of <xref target='RFC8881' format='default' sectionFormat='of' />)
        that the client can provide to the metadata server.
      </t>
    </section>


    <section anchor='sec_mix_types' numbered='true' removeInRFC='false' toc='default'>
      <name>Mixing of Encoding Types</name>
      <t>
        Note that effectively, multiple encoding types can be present
        in a Flexible Files Version 2 Layout Type layout.  The ffv2_layout4 has an array
        of ffv2_mirror4, each of which has a ffv2_encoding_type.
        The main reason to allow for this is to provide for either the
        assimilation of a non-erasure encoded file to an erasure
        encoded file or the exporting of an erasure encoded file to
        a non-erasure encoded file.
      </t>
      <t>
        Assume there is an additional ffv2_encoding_type of
        FFV2_ENCODING_REED_SOLOMON and it needs 4 active blocks,
        2 parity blocks, and 2
        spare blocks. The user wants to actively assimilate a regular
        file. As such, a layout might be as represented in <xref
        target='mixed_layout' />.  As this is an assimilation, most of
        the data reads will be satisfied by READ4 (see Section 18.22 of
        <xref target='RFC8881' format='default' sectionFormat='of' />)
        calls to index 0. However, as this is also an active file,
        there could also be READ_BLOCK4 (see <xref target='READ_BLOCK4' />)
        calls to the other indexes.
      </t>

      <figure anchor='mixed_layout'>
        <artwork>
         +---------------------------------------------------+
         | ffv2_layout4:                                     |
         +---------------------------------------------------+
         |     ffl_mirrors[0]:                               |
         |         ffm_data_servers:                         |
         |             ffv2_data_server4[0]                  |
         |                 ffds_flags: 0                     |
         |         ffm_encoding: FFV2_ENCODING_MIRRORED      |
         +---------------------------------------------------+
         |     ffl_mirrors[1]:                               |
         |         ffm_data_servers:                         |
         |             ffv2_data_server4[0]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[1]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[2]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[3]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[4]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
         |             ffv2_data_server4[5]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
         |             ffv2_data_server4[6]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
         |             ffv2_data_server4[7]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
         |     ffm_encoding: FFV2_ENCODING_REED_SOLOMON      |
         +---------------------------------------------------+
        </artwork>
      </figure>
      <t>
        When performing I/O via a FFV2_ENCODING_MIRRORED encoding
        type, the non-transformed data will be used, Whereas with
        other encoding types, a metadata header and transformed block will
        be sent. Further, when reading data from the instance files,
        the client <bcp14>MUST</bcp14> be prepared to have one of the
        encoding types supply data and the other type not to supply
        data. I.e., the READ_BLOCK4 call might return rlr_eof set to true
        (see <xref target='code_READ_BLOCK4resok' />),
        which indicates that there is no data, where the  READ4 call might
        return eof to be false, which indicates that there is data. The
        client <bcp14>MUST</bcp14> determine that there is in fact data.
      </t>
      <t>
        An example use case is the active assimilation of a file to
        ensure integrity. As the client is helping to translated the
        file to the new encoding scheme, it is actively modifying the
        file. As such, it might be sequentially reading the file in
        order to translate. The READ4 call would be returning data and
        the READ_BLOCK4 would not be returning data. As the client
        overwrites the file, the WRITE4 call and the WRITE_BLOCK4
        call would both have data sent. Finally, if the client
        read back a section which had been modified earlier, both
        the READ4 and READ_BLOCK4 calls would return data.
      </t>
    </section>
  </section>

  <section anchor='sec_erasure_encoding' numbered='true' removeInRFC='false' toc='default'>
    <name>Erasure Encoding</name>
    <t>
      Erasure Encoding takes an data block and transforms it to a payload to
      send to the data servers (see <xref target='encoding_transformation' />). It
      generates a metadata header and transformed block per data server. The header is metadata
      information for the transformed block. From now on, the metadata is
      simply referred to as the header and the transformed block as the
      block. The payload of a data block is the set of generated headers and blocks
      for that data block.
    </t>

    <t>
      The change_id is an unique identifier generated by the client to describe
      the current write transaction. The client_id is an unique identifier
      assigned by the metadata server to describe which client is making
      the current write transaction. The seq_id describes the index across payload.
      The eff_len is the length of the data within the block. Finally, the crc32 is
      the 32 bit crc calculation of the header (with the crc32
      field being 0) and the block. By combining the two
      parts of the payload, integrity is ensured for both the
      parts.
    </t>

    <t>
      While the data block might have a length of 4kB, that does not
      necessarily mean that the length of the block
      is 4kB. That length is determined by the erasure encoding type
      algorithm. For example, Reed Solomon might have 4kB
      blocks with the data integrity being compromised by
      parity blocks. Another example would be the Mojette Transformation,
      which might have 1kB block lengths.
    </t>
    <t>
      The payload contains redundancy which will allow
      the erasure encoding type algorithm to repair
      blocks in the payload as it is transformed back to a data block (see
      <xref target='decoding_transformation' />).
      A payload is consistent when all of the contained headers
      share the same change_id and client_id. It has integrity
      when it is consistent and the blocks all pass the crc32 checks.
    </t>

    <section anchor='sec_encoding_transformation' numbered='true' removeInRFC='false' toc='default'>
      <name>Encoding a Data Block</name>

      <figure anchor='encoding_transformation'>
        <artwork>
                      +-----------------+
                      |  data block     |
                      +-----------------+
                      |                 |
                      | 3kB data        |
                      |                 |
                      +-----------------+
                      | 1kB empty       |
                      +-------+---------+
                              |
                              |
       +----------------------+-----------------------+
       |      Erasure Encoding (Transform Forward)    |
       +----+-------------------------------------+---+
            |                                     |
            |                                     |
        +---+----------------+         +----------+---------+
        | HEADER             |         | HEADER             |
        +--------------------+         +--------------------+
        | change_id: 3       |         | change_id: 3       |
        | client_id: 6       |         | client_id: 6       |
        | seq_id   : 0       |         | seq_id   : 5       |
        | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
        | crc32    :         |         | crc32    :         |
        +--------------------+         +--------------------+
        | BLOCK              |         | BLOCK              |
        +--------------------+         +--------------------+
        | data: ....         |         | data: ....         |
        +--------------------+         +--------------------+
             Data Server 1                 Data Server 6
        </artwork>
      </figure>

      <t>
        Each data block of the file resident in the client's cache of the
        file will be encoded into N different payloads to be
        sent to the data servers as shown in <xref target='encoding_transformation' />.
        As WRITE_BLOCK4 (see <xref target='WRITE_BLOCK4' />) can encode
        multiple write_block4 into a single transaction, a more accurate
        description of a WRITE_BLOCK4 might be as in <xref target='example_WRITE_BLOCK4_args_1' />.
      </t>

      <figure anchor='example_WRITE_BLOCK4_args_1'>
        <artwork>
        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 3         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x32ef89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[1]:                      |
        |            wb_crc    :  0x56fa89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[2]:                      |
        |            wb_crc    :  0x7693af   |
        |            wb_effective_len  : 3kB |
        |            wb_block  :  ......     |
        +------------------------------------+
        </artwork>
      </figure>

      <t>
        <cref anchor='AI13' source='DF'>pay attention to the 128 bits alignment for wb_block_val</cref>
      </t>

      <t>
        This describes a 3 block write of data from an offset of 1 block in the file.
        As each block shares the wba_owner, it is only
        presented once. I.e., the data server will be able to construct the
        header for each wba_block from the wba_seq_id, wba_owner,
        wb_effective_len, and wb_crc.
      </t>

      <t>
        Assuming that there were no issues, <xref target='example_WRITE_BLOCK4_res_1' />
        illustrates the results. The payload sequence id is implicit in the WRITE_BLOCK4args.
      </t>

      <figure anchor='example_WRITE_BLOCK4_res_1'>
        <artwork>
        +-------------------------------+
        | WRITE_BLOCK4resok             |
        +-------------------------------+
        | wbr_count: 3                  |
        | wbr_committed: FILE_SYNC4     |
        | wbr_writeverf: 0xf1234abc     |
        | wbr_owners[0]:                |
        |            bo_block_id: 1     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        | wbr_owners[1]:                |
        |            bo_block_id: 2     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        | wbr_owners[2]:                |
        |            bo_block_id: 3     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        +-------------------------------+
        </artwork>
      </figure>

      <section anchor='calculating_crc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Calculating the CRC32</name>
        <figure anchor='crc_before_calc'>
          <artwork>
        +---+----------------+
        | HEADER             |
        +--------------------+
        | change_id: 7       |
        | client_id: 6       |
        | seq_id   : 0       |
        | eff_len  : 3kB     |
        | crc32    : 0       |
        +--------------------+
        | BLOCK              |
        +--------------------+
        | data:  ....        |
        +--------------------+
             Data Server 1
          </artwork>
        </figure>

        <t>
          Assuming the header and payload as in <xref target='crc_before_calc' />,
          the crc32 needs to be calculated in order to fill in the wb_crc field. In this
          case, the crc32 is calculated over the 5 fields as shown in the
          header and the data of the block. In this example, it is calculated
          to be 0x21de8. The resulting WRITE_BLOCK4 is shown in <xref target='crc_after_calc' />.
        </t>

        <figure anchor='crc_after_calc'>
          <artwork>
        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 7         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x21de8    |
        |            wb_effective_len  : 3kB |
        |            wb_block  :  ......     |
        +------------------------------------+
          </artwork>
        </figure>
      </section>
    </section>

    <section anchor='sec_decoding_transformation' numbered='true' removeInRFC='false' toc='default'>
      <name>Decoding a Data Block</name>
      <figure anchor='decoding_transformation'>
        <artwork>
             Data Server 1                 Data Server 6
        +--------------------+         +--------------------+
        | HEADER             |         | HEADER             |
        +--------------------+         +--------------------+
        | change_id: 1       |         | change_id: 1       |
        | client_id: 6       |         | client_id: 6       |
        | seq_id   : 0       |         | seq_id   : 5       |
        | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
        | crc32    :         |         | crc32    :         |
        +--------------------+         +--------------------+
        | BLOCK              |         | BLOCK              |
        +--------------------+         +--------------------+
        | data:  ....        |         | data:  ....        |
        +---+----------------+         +----------+---------+
            |                                     |
            |                                     |
       +----+-------------------------------------+---+
       |      Erasure Decoding (Transform Reverse)    |
       +----------------------+-----------------------+
                              |
                              |
                      +-------+---------+
                      |  data block     |
                      +-----------------+
                      |                 |
                      | 3kB data        |
                      |                 |
                      +-----------------+
                      | 1kB empty       |
                      +-----------------+
        </artwork>
      </figure>
      <t>
        When reading blocks via a READ_BLOCK4 operation, the client will decode
        the headers and payload into data blocks as shown in
        <xref target='decoding_transformation' />.  If the resulting data block
        is to be sized less than a data block, i.e., the rb_effective_len
        is less than the data block size, then the inverse transformation
        <bcp14>MUST</bcp14> fill the remainder of the data block with 0s.
        It must appear as a freshly written data block which was not
        completely filled.
      </t>

      <t>
        Note that at this time, the
        client could detect issues in the integrity of the data. The handling
        and repair are out of the scope of this document and <bcp14>MUST</bcp14>
        be addressed in the document describing each erasure encoding type.
      </t>

      <section anchor='checking_crc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Checking the CRC32</name>

        <figure anchor='crc_on_wire'>
          <artwork>
        +------------------------------------+
        | READ_BLOCK4resok                   |
        +------------------------------------+
        | rbr_eof: false                     |
        | rbr_blocks[0]:                     |
        |            rb_crc: 0x21de8         |
        |            rb_effective_len  : 3kB |
        |            rb_owner:               |
        |                 bo_block_id: 1     |
        |                 bo_change_id: 7    |
        |                 bo_client_id: 6    |
        |                 bo_activated: true |
        |            rb_block  :  ......     |
        +------------------------------------+
          </artwork>
        </figure>

        <t>
          Assuming the READ_BLOCK4 results as in <xref target='crc_on_wire' />,
          the crc32 needs to be checked in order to ensure data integrity. Conceptually,
          a header and payload can be built as shown in <xref target='crc_checking' />.
          The crc32 is calculated over the 5 fields as shown in the
          header and the 3kB of data block. In this example, it is calculated
          to be 0x21de8. Thus this payload for the data server has data integrity.
        </t>

        <figure anchor='crc_checking'>
          <artwork>
        +---+----------------+
        | HEADER             |
        +--------------------+
        | change_id: 7       |
        | client_id: 6       |
        | seq_id   : 0       |
        | eff_len  : 3kB     |
        | crc32    : 0       |
        +--------------------+
        | BLOCK              |
        +--------------------+
        | data:  ....        |
        +--------------------+
             Data Server 1
          </artwork>
        </figure>

      </section>
    </section>
  </section>

  <section anchor='blocks_activate' numbered='true' removeInRFC='false' toc='default'>
    <name>Blocks and Activating</name>
    <t>
      Unlike the regular NFSv4.2 I/O operations, the base unit of I/O in this
      document is the block. The raw data stream is encoded/decoded into
      blocks as described in <xref target='sec_erasure_encoding' />.
      Each block has the concept of whether it is activated or pending activation. This is
      crucial in detecting write holes. A write hole occurs either when two
      different clients write to the same block concurrently or when a
      client overwrites existing data. In the first scenario, the order
      of writes is not deterministic and can result in a mixture of blocks
      in the payload. In the last scenario, network partitions or client
      restarts can result in partial writes. In both cases, the blocks have
      to be repaired, either by abandoning the new I/O or by sorting out
      the winner. Note that unlike the case of the encoding type detecting
      data integrity issues (see <xref target='sec_decoding_transformation'/>),
      the case of write holes is in the scope of this document.
    </t>
    <t>
      What is out of scope of this document is the manner in which the
      data servers implement the semantics of the new operations. I.e.,
      the data servers might be able to leverage the native file system
      to achieve the semantics or it might completely implement a
      multi-file approach to stage WRITE_BLOCK4 results and then
      shuffle blocks when the ACTIVATE_BLOCK4 or ROLLBACK_BLOCK4 operations
      activate the data.
    </t>

    <section anchor='client_died' numbered='true' removeInRFC='false' toc='default'>
      <name>Dead or Partitioned Client</name>
      <t>
        Consider a client which was in the middle of sending WRITE_BLOCK4 to
        a set of data servers and it crashes. Regardless of
        whether it comes back online or not, the metadata server can
        detect that the client had restarted when it had an outstanding
        LAYOUTIOMODE4_RW on the file. The metadata server can assign
        the file to a repair program, which would basically scan the entire
        file with READ_BLOCK_STATUS4. When it determines that it does not
        have enough payload blocks to rebuild the data block, it can
        determine that the I/O for that data block was not complete and
        throw away the blocks.
      </t>
      <t>
        Note that the repair process can throw away the blocks
        by using the ROLLBACK_BLOCK4 operation to unstage the pending written blocks.
      </t>
    </section>

    <section anchor='client_overwrite' numbered='true' removeInRFC='false' toc='default'>
      <name>Client Overwrite</name>
      <t>
        Consider a client which gets back conflicting information in the WRITE_BLOCK4
        results.  Assume that we had written to 6 data servers with WRITE_BLOCK4s
        as in <xref target='example_WRITE_BLOCK4_args_2' />. And we get the
        results as in <xref target='example_WRITE_BLOCK4_res_2_a' />.
      </t>

      <figure anchor='example_WRITE_BLOCK4_args_2'>
        <artwork>
        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 3         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x32ef89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[1]:                      |
        |            wb_crc    :  0x56fa89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        +------------------------------------+
        </artwork>
      </figure>

      <t>
        <xref target='example_WRITE_BLOCK4_res_2_a' /> shows that the
        first block was an overwrite and an activation has to be done in order
        for the newly written block to be returned in a READ_BLOCK4. Assume
        that the next four data servers had the same type of response.
      </t>

      <figure anchor='example_WRITE_BLOCK4_res_2_a'>
        <artwork>
                Data Server 1
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 2     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
        </artwork>
      </figure>

      <t>
        But assume that data server 4 does not respond to the WRITE_BLOCK4
        operation. While the client can detect this and send the WRITE_BLOCK4
        to any data server marked as FFV2_DS_FLAGS_SPARE, it might decide
        to see if the data server did in fact do the transaction. It might
        also be the case that there are no data servers marked as
        FFV2_DS_FLAGS_SPARE.  The client issues a READ_BLOCK_STATUS4
        (see <xref target='example_READ_BLOCK_STATUS4_args_1' />)
        and gets the results in <xref target='example_READ_BLOCK_STATUS4_res_2_b' />.
        This indicates that data server 4 did not get the WRITE_BLOCK4
        request.
      </t>

      <t>
        In general, the client can either resend the WRITE_BLOCK4 request,
        determine by the erasure encoding type that there is sufficient
        payload blocks present to decode the data block, or ROLLBACK_BLOCK4
        the existing blocks to back out the change.
      </t>

      <figure anchor='example_READ_BLOCK_STATUS4_args_1'>
        <artwork>
                Data Server 4
        +--------------------------------+
        | READ_BLOCK_STATUS4args         |
        +--------------------------------+
        | rbsa_stateid: 0                |
        | rbsa_offset: 1                 |
        | rbsa_count: 3                  |
        +----------+---------------------+
        </artwork>
      </figure>

      <figure anchor='example_READ_BLOCK_STATUS4_res_2_b'>
        <artwork>
                Data Server 4
        +--------------------------------+
        | READ_BLOCK_STATUS4resok        |
        +--------------------------------+
        | rbsr_eof: true                 |
        | rbsr_blocks[0]:                |
        |            bo_block_id: 1      |
        |            bo_change_id: 2     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
        </artwork>
      </figure>
    </section>

    <section anchor='racing_clients' numbered='true' removeInRFC='false' toc='default'>
      <name>Racing Clients</name>
      <t>
        Assume that the client has written to 6 data servers with WRITE_BLOCK4s
        as in <xref target='example_WRITE_BLOCK4_args_2' />. But now it gets back
        the conflicting results in <xref target='example_WRITE_BLOCK4_res_3_a' />
        and <xref target='example_WRITE_BLOCK4_res_3_b' />. From this, it can
        detect that there was a race with another client. Note, even though
        both clients present the same bo_change_id, nothing can be inferred
        as to the ordering of the two transactions. In some cases, bo_client_id 10
        won the race and in some cases, bo_client_id 6 won the race.
      </t>

      <t>
        As a subsequent READ_BLOCK4 will produce garbage, the clients need
        to agree on how to fix this issue without any communication. A simplistic
        approach is for each client to retry the WRITE_BLOCK4 until such time
        as the payload is consistent. Note, this does not mean that both
        clients win, it just means that one of them wins.
      </t>

      <t>
        Another option is for the clients to report a LAYOUTERROR4
        (see Section 15.6 of <xref target='RFC7862' format='default'
        sectionFormat='of' />) to the metadata server with an error of
        NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. That would then
        allow the metadata server to assign the repairing of the
        file.
      </t>

      <figure anchor='example_WRITE_BLOCK4_res_3_a'>
        <artwork>
                Data Server 1
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
        </artwork>
      </figure>

      <figure anchor='example_WRITE_BLOCK4_res_3_b'>
        <artwork>
                Data Server 2
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
        </artwork>
      </figure>

      <section anchor='multiple_writers' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Multiple Writers</name>
        <t>
          Note that nothing prevents pending blocks from accumulating or from more than
          2 writers trying to write the same payload. An example of such a WRITE_BLOCK4resok
          in response to the example of <xref target='example_WRITE_BLOCK4_args_2' /> is shown
          in <xref target='example_WRITE_BLOCK4_res_3_c' />. Note only has client 6 tried to
          update the block 1, but all of clients 6, 7, and 20  are attempting to update it.
        </t>
        <figure anchor='example_WRITE_BLOCK4_res_3_c'>
          <artwork>
                Data Server 2
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 4     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 20    |
        |            bo_client_id: 7     |
        |            bo_activated: false |
        | wbr_owners[3]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: false |
        | wbr_owners[4]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
          </artwork>
        </figure>
      </section>
    </section>

    <section anchor='reader_writer' numbered='true' removeInRFC='false' toc='default'>
      <name>Reader and Writer Racing</name>
      <t>
        In addition to the above write hole scenarios, a further complication
        is a racing reader and writer. If the client reads a block and determines
        that the payload is not consistent (i.e., not all of the payload blocks
        share the same client_id and change_id), then it can assume that it
        has encountered a race with another client writing to the file. It
        <bcp14>SHOULD</bcp14> retry the READ_BLOCK4 operation until payload
        consistency is achieved. It may determine to send a LAYOUTERROR4
        to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT.

        <cref anchor='AI24' source='TH'>
          And should it hang forever? Perhaps
          a new layout error that the client can send the MDS?
          Or should it probe with READ_BLOCK_STATUS4 to try to repair?
        </cref>
        <cref anchor='AI25' source='TH'>
          Perhaps a LAYOUTERROR_BLOCK4 to send an encoding type specific location?
        </cref>
      </t>
    </section>
  </section>

  <section anchor='supporting' numbered='true' removeInRFC='false' toc='default'>
    <name>New Infrastructure</name>

    <section anchor='errors' numbered='true' removeInRFC='false' toc='default'>
      <name>Errors</name>
      <section anchor='NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Error 10097 - NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT</name>
        <t>
          The client encountered a payload in which the blocks were inconsistent and
          stays inconsistent. As the client can not tell if another client is
          actively writing, it informs the metadata server of this error via
          LAYOUTERROR4.  The metadata server can then arrange for
          repair of the file.
        </t>
        <t>
          Note that due to the opaqueness of the clientid4, the client can not
          differentiate between boot instances of the metadata server or client, but the
          metadata server can do that differentiation. I.e., it can tell if the
          inconsistency is from the same client, whether that client is active
          and actively writing to the file (i.e., does the client have the file
          open and with a LAYOUTIOMODE4_RW layout?).
        </t>
      </section>
      <section anchor='NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Error 10098 - NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED</name>
        <t>
          The client requested a ffv2_encoding_type which the metadata server does not support. I.e.,
          if the client sends a layout_hint requesting an erasure encoding type that the
          metadata server does not support, this error code can be returned. The client
          might have to send the layout_hint several times to determine the overlapping
          set of supported erasure encoding types.
        </t>
      </section>
      <section anchor='NFS4ERR_ERASURE_ENCODING_BLOCK_MISMATCH' numbered='true' removeInRFC='false' toc='exclude'>
        <name>Error 10099 - NFS4ERR_ERASURE_ENCODING_BLOCK_MISMATCH</name>
        <t>
          The client requested to the data server to update the header only 
          and the data server can not find a matching block at that offset.
        </t>
      </section>
    </section>

    <section anchor='EXCHGID4_FLAG_USE_ERASURE_DS' numbered='true' removeInRFC='false' toc='default'>
      <name>EXCHGID4_FLAG_USE_PNFS_DS</name>
      <figure anchor='code_EXCHGID4_FLAG_USE_ERASURE_DS'>
        <sourcecode type='xdr'>
/// const EXCHGID4_FLAG_USE_ERASURE_DS      = 0x00100000;
        </sourcecode>
      </figure>

      <t>
        When a data server connects to a metadata server it
        can via EXCHANGE_ID (see Section 18.35 of <xref target='RFC8881'
        format='default' sectionFormat='of' />) state its pNFS role.
        The data server can use EXCHGID4_FLAG_USE_ERASURE_DS
        (see <xref target='code_EXCHGID4_FLAG_USE_ERASURE_DS' />)
        to indicate that it supports the new NFSv4.2 operations
        introduced in this document.  Section 13.1 <xref target='RFC8881'
        format='default' sectionFormat='of' /> describes the
        interaction of the various pNFS roles masked by EXCHGID4_FLAG_MASK_PNFS.
        However, that does not mask out EXCHGID4_FLAG_USE_ERASURE_DS.
        I.e., EXCHGID4_FLAG_USE_ERASURE_DS can be used in combination
        with all of the pNFS flags.
      </t>
      <t>
        If the data server sets EXCHGID4_FLAG_USE_ERASURE_DS during the
        EXCHANGE_ID operation, then it <bcp14>MUST</bcp14> support:
        ACTIVATE_BLOCK4, READ_BLOCK_STATUS4, READ_BLOCK4, ROLLBACK_BLOCK4,
        and WRITE_BLOCK4. Further, note that this support is
        orthoganol to the Erasure Encoding Type selected. The
        data server is unaware of which type is driving the I/O.
        It is also unaware of the payload layout or what type
        of block it is serving.
      </t>
    </section>

    <section anchor='block_owner' numbered='true' removeInRFC='false' toc='default'>
      <name>Block Owner</name>

      <figure anchor='code_block_owner4'>
        <sourcecode type='xdr'>
/// struct block_owner4 {
///     uint32_t    bo_block_id;
///     changeid4   bo_change_id;
///     clientid4   bo_client_id;
///     bool        bo_activated;
/// };
        </sourcecode>
      </figure>

      <t>
        The block_owner4 (see <xref target='code_block_owner4' />)
        is used to determine when and by whom a block was written.
        The bo_block_id is used to identify the block and <bcp14>MUST</bcp14>
        be the index of the block within the file. I.e., it is the
        offset of the start of the block divided by the block len.
        The bo_client_id <bcp14>MUST</bcp14> be the client id handed out
        by the metadata server to the client as the eir_clientid during
        the EXCHANGE_ID results (see Section 18.35 of <xref target='RFC8881'
        format='default' sectionFormat='of' />)  and <bcp14>MUST NOT</bcp14>
        be the client id supplied by the data server to the client. I.e.,
        across all data files, the bo_client_id uniquely describes one and
        only one client.
      </t>
      <t>
        The bo_change_id is like the change attribute
        (see Section 5.8.1.4 of <xref target='RFC8881' format='default'
        sectionFormat='of' />) in that each block write by a given
        client has to have an unique bo_change_id. I.e., it can
        be determined which transaction across all data files that
        a block corresponds.
      </t>
      <t>
        The bo_activated is used by the data server to indicate whether
        the block I/O was activated or pending activation. The first WRITE_BLOCK4 to
        a location is automatically activated if the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY
        is set. Subsequent WRITE_BLOCK4 modifications
        to that block location are not automatically activated. The
        client has to ACTIVATE_BLOCK4 the block in order to get it activated.
      </t>
      <t>
        The concept of automatically activating is dependent on the
        wba_stable field of the WRITE_BLOCK4args.
      </t>
    </section>
  </section>

  <section anchor='ops' numbered='true' removeInRFC='false' toc='default'>
    <name>New NFSv4.2 Operations</name>
    <section anchor='ACTIVATE_BLOCK4' numbered='true' removeInRFC='false' toc='default'>
      <name>Operation 77: ACTIVATE_BLOCK4 - Activate Cached Block Data</name>
      <section anchor='ACTIVATE_BLOCK4_args' numbered='true' removeInRFC='false' toc='exclude'>
        <name>ARGUMENTS</name>
        <figure anchor='code_ACTIVATE_BLOCK4args'>
          <sourcecode type='xdr'>
/// struct ACTIVATE_BLOCK4args {
///     /* CURRENT_FH: file */
///     offset4         aba_offset;
///     count4          aba_count;
///     block_owner4    aba_blocks&lt;&gt;;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='ACTIVATE_BLOCK4_res' numbered='true' removeInRFC='false' toc='exclude'>
        <name>RESULTS</name>
        <figure anchor='code_ACTIVATE_BLOCK4resok'>
          <sourcecode type='xdr'>
/// struct ACTIVATE_BLOCK4resok {
///     verifier4       abr_writeverf;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_ACTIVATE_BLOCK4res'>
          <sourcecode type='xdr'>
/// union ACTIVATE_BLOCK4res switch (nfsstat4 abr_status) {
///     case NFS4_OK:
///         ACTIVATE_BLOCK4resok   abr_resok4;
///     default:
///         void;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='ACTIVATE_BLOCK4_desc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>DESCRIPTION</name>
        <t>
          ACTIVATE_BLOCK4 is COMMIT4 (see Section 18.3 of <xref target='RFC8881'
          format='default' sectionFormat='of' />) with additional semantics
          over the block_owner activating the blocks. As such, all of
          the normal semantics of COMMIT4 directly apply.
        </t>
        <t>
          The main difference between the two operations is that ACTIVATE_BLOCK4
          works on blocks and not a raw data stream. As such aba_offset
          is the starting block offset in the file and not the byte
          offset in the file. Some erasure encoding types can have
          different block sizes depending on the
          block type. Further, aba_count is a count of blocks to activate
          and not bytes to activate.
        </t>
        <t>
          Further, while it may appear that the combination of aba_offset
          and aba_count are redundant to aba_blocks, the purpose of
          aba_blocks is to allow the data server to differentiate between
          potentially multiple pending blocks.
        </t>
      </section>
    </section>

    <section anchor='READ_BLOCK_STATUS4' numbered='true' removeInRFC='false' toc='default'>
      <name>Operation 78: READ_BLOCK_STATUS4 - Read Block Commit Status from File</name>
      <section anchor='READ_BLOCK_STATUS4_args' numbered='true' removeInRFC='false' toc='exclude'>
        <name>ARGUMENTS</name>
        <figure anchor='code_READ_BLOCK_STATUS4args'>
          <sourcecode type='xdr'>
/// struct READ_BLOCK_STATUS4args {
///     /* CURRENT_FH: file */
///     stateid4    rbsa_stateid;
///     offset4     rbsa_offset;
///     count4      rbsa_count;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='READ_BLOCK_STATUS4_res' numbered='true' removeInRFC='false' toc='exclude'>
        <name>RESULTS</name>
        <figure anchor='code_READ_BLOCK_STATUS4resok'>
          <sourcecode type='xdr'>
/// struct READ_BLOCK_STATUS4resok {
///     bool            rbsr_eof;
///     block_owner4    rbsr_blocks&lt;&gt;;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_READ_BLOCK_STATUS4res'>
          <sourcecode type='xdr'>
/// union READ_BLOCK_STATUS4res switch (nfsstat4 rbsr_status) {
///     case NFS4_OK:
///         READ_BLOCK4resok     rbsr_resok4;
///     default:
///         void;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='READ_BLOCK_STATUS4_desc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>DESCRIPTION</name>
        <t>
          READ_BLOCK_STATUS4 differs from READ_BLOCK4 in that it only reads
          active and pending headers in the desired data range.
        </t>
      </section>
    </section>

    <section anchor='READ_BLOCK4' numbered='true' removeInRFC='false' toc='default'>
      <name>Operation 79: READ_BLOCK4 - Read Blocks from File</name>
      <section anchor='READ_BLOCK4_args' numbered='true' removeInRFC='false' toc='exclude'>
        <name>ARGUMENTS</name>
        <figure anchor='code_READ_BLOCK4args'>
          <sourcecode type='xdr'>
/// struct READ_BLOCK4args {
///     /* CURRENT_FH: file */
///     stateid4    rba_stateid;
///     offset4     rba_offset;
///     count4      rba_count;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='READ_BLOCK4_res' numbered='true' removeInRFC='false' toc='exclude'>
        <name>RESULTS</name>

        <figure anchor='code_read_block4'>
          <sourcecode type='xdr'>
/// struct read_block4 {
///     uint32_t        rb_crc;
///     uint32_t        rb_effective_len;
///     block_owner4    rb_owner;
///     uint32_t        rb_seq_id;
///     opaque          rb_block&lt;&gt;;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_READ_BLOCK4resok'>
          <sourcecode type='xdr'>
/// struct READ_BLOCK4resok {
///     bool        rbr_eof;
///     read_block4 rbr_blocks&lt;&gt;;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_READ_BLOCK4res'>
          <sourcecode type='xdr'>
/// union READ_BLOCK4res switch (nfsstat4 rbr_status) {
///     case NFS4_OK:
///          READ_BLOCK4resok     rbr_resok4;
///     default:
///          void;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='READ_BLOCK_desc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>DESCRIPTION</name>
        <t>
          READ_BLOCK is READ4 (see Section 18.22 of <xref target='RFC8881'
          format='default' sectionFormat='of' />) with additional semantics
          over the block_owner and the activation of blocks. As such, all of
          the normal semantics of READ4 directly apply.
        </t>
        <t>
          The main difference between the two operations is that READ_BLOCK
          works on blocks and not a raw data stream. As such rba_offset
          is the starting block offset in the file and not the byte
          offset in the file. Some erasure encoding types can have
          different block sizes depending on the
          block type. Further, rba_count is a count of blocks to read
          and not bytes to read.
        </t>
        <t>
          READ_BLOCK also only returns the activated block at the location.
          I.e., if a client overwrites a block at offset 10, then tries
          to read the block without activating it, then the original
          block is returned.
        </t>
        <t>
          When reading a set of blocks across the data servers, it can be
          the case that some data servers do not have any data at that
          location. In that case, the server either returns rbr_eof
          if the rba_offset exceeds the number of blocks that
          the data server is aware or it returns an empty block
          for that block.
        </t>

        <t>
          For example, in <xref target='example_READ_BLOCK4_args_1' />, the
          client asks for 4 blocks starting with the 3rd block in the file.
          The second data server responds as in <xref target='example_READ_BLOCK_res_1' />.
          The client would read this as there is valid data for blocks 2 and
          4, there is a hole at block 3, and there is no data for block 5.
          Note that the data server <bcp14>MUST</bcp14> calculate a valid
          rb_crc for block 3 based on the generated fields.
        </t>

        <figure anchor='example_READ_BLOCK4_args_1'>
          <artwork>
                Data Server 2
        +--------------------------------+
        | READ_BLOCK4args                |
        +--------------------------------+
        | rba_stateid: 0                 |
        | rba_offset: 2                  |
        | rba_count: 4                   |
        +----------+---------------------+
          </artwork>
        </figure>

      <figure anchor='example_READ_BLOCK_res_1'>
        <artwork>
                Data Server 2
        +--------------------------------+
        | READ_BLOCK4resok               |
        +--------------------------------+
        | rbr_eof: true                  |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0x3faddace         |
        |     rb_effective_len: 4kB      |
        |     rb_owner:                  |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        |     rb_seq_id: 1               |
        |     rb_block: ....             |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0xdeade4e5         |
        |     rb_effective_len: 4kB      |
        |     rb_owner:                  |
        |            bo_block_id: 3      |
        |            bo_change_id: 0     |
        |            bo_client_id: 0     |
        |            bo_activated: false |
        |     rb_seq_id: 1               |
        |     rb_block: 0000...00000     |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0x7778abcd         |
        |     rb_effective_len: 2kB      |
        |     rb_owner:                  |
        |            bo_block_id: 4      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        |     rb_seq_id: 1               |
        |     rb_block: ....             |
        +--------------------------------+
        </artwork>
      </figure>

      </section>
    </section>

    <section anchor='ROLLBACK_BLOCK' numbered='true' removeInRFC='false' toc='default'>
      <name>Operation 80: ROLLBACK_BLOCK - Rollback Cached Block Data</name>
      <section anchor='ROLLBACK_BLOCK4_args' numbered='true' removeInRFC='false' toc='exclude'>
        <name>ARGUMENTS</name>
        <figure anchor='code_ROLLBACK_BLOCK4args'>
          <sourcecode type='xdr'>
/// struct ROLLBACK_BLOCK4args {
///     /* CURRENT_FH: file */
///     offset4         rba_offset;
///     count4          rba_count;
///     block_owner4    rba_blocks&lt;&gt;;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='ROLLBACK_BLOCK_res' numbered='true' removeInRFC='false' toc='exclude'>
        <name>RESULTS</name>
        <figure anchor='code_ROLLBACK_BLOCK4resok'>
          <sourcecode type='xdr'>
/// struct ROLLBACK_BLOCK4resok {
///     verifier4       rbr_writeverf;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_ROLLBACK_BLOCK4res'>
          <sourcecode type='xdr'>
/// union ROLLBACK_BLOCK4res switch (nfsstat4 rbr_status) {
///     case NFS4_OK:
///         ROLLBACK_BLOCK4resok   rbr_resok4;
///     default:
///         void;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='ROLLBACK_BLOCK4_desc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>DESCRIPTION</name>
        <t>
          ROLLBACK_BLOCK4 is a new form like COMMIT4 (see Section 18.3 of <xref target='RFC8881'
          format='default' sectionFormat='of' />) with additional semantics
          over the block_owner the rolling back the writing of blocks. As such, all of
          the normal semantics of COMMIT4 directly apply.
        </t>

        <t>
          The main difference between the two operations is that ROLLBACK_BLOCK4
          works on blocks and not a raw data stream. As such rba_offset
          is the starting block offset in the file and not the byte
          offset in the file. Some erasure encoding types can have
          different block sizes depending on the
          block type. Further, rba_count is a count of blocks to rollback
          and not bytes to rollback.
        </t>
        <t>
          Further, while it may appear that the combination of rba_offset
          and rba_count are redundant to rba_blocks, the purpose of
          rba_blocks is to allow the data server to differentiate between
          potentially multiple pending blocks.
        </t>

        <t>
          ROLLBACK_BLOCK4 deletes prior WRITE_BLOCK4 transactions. In case of write
          holes, it allows the client to undo transactions to repair the file.
        </t>
      </section>
    </section>

    <section anchor='WRITE_BLOCK4' numbered='true' removeInRFC='false' toc='default'>
      <name>Operation 81: WRITE_BLOCK4 - Write Blocks to File</name>
      <section anchor='WRITE_BLOCK4_args' numbered='true' removeInRFC='false' toc='exclude'>
        <name>ARGUMENTS</name>

        <figure anchor='code_wb_args_flags'>
          <sourcecode type='xdr'>
/// const WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY   = 0x00000001;
/// const WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY      = 0x00000002;
          </sourcecode>
        </figure>

        <figure anchor='code_write_block4'>
          <sourcecode type='xdr'>
/// struct write_block4 {
///     uint32_t        wb_crc;
///     uint32_t        wb_effective_len;
///     uint32_t        wb_flags;
///     opaque          wb_block&lt;&gt;;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_guard_block_owner4'>
          <sourcecode type='xdr'>
/// struct guard_block_owner4 {
///     changeid4   gbo_change_id;
///     clientid4   gbo_client_id;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_write_block_guard4'>
          <sourcecode type='xdr'>
/// union write_block_guard4 (bool wbg_check) {
///     case TRUE:
///         guard_block_owner4   wbg_block_owner;
///     case FALSE:
///         void;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_WRITE_BLOCK4args'>
          <sourcecode type='xdr'>
/// struct WRITE_BLOCK4args {
///     /* CURRENT_FH: file */
///     stateid4           wba_stateid;
///     offset4            wba_offset;
///     stable_how4        wba_stable;
///     block_owner4       wba_owner;
///     uint32_t           wba_seq_id;
///     write_block_guard4 wba_guard;
///     write_block4       wba_data&lt;&gt;;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='WRITE_BLOCK4_res' numbered='true' removeInRFC='false' toc='exclude'>
        <name>RESULTS</name>

        <figure anchor='code_WRITE_BLOCK4resok'>
          <sourcecode type='xdr'>
/// struct WRITE_BLOCK4resok {
///     count4          wbr_count;
///     stable_how4     wbr_committed;
///     verifier4       wbr_writeverf;
///     block_owner4    wbr_owners&lt;&gt;;
/// };
          </sourcecode>
        </figure>

        <figure anchor='code_WRITE_BLOCK4res'>
          <sourcecode type='xdr'>
/// union WRITE_BLOCK4res switch (nfsstat4 wbr_status) {
///     case NFS4_OK:
///         WRITE_BLOCK4resok    wbr_resok4;
///     default:
///         void;
/// };
          </sourcecode>
        </figure>
      </section>

      <section anchor='WRITE_BLOCK4_desc' numbered='true' removeInRFC='false' toc='exclude'>
        <name>DESCRIPTION</name>
        <t>
          WRITE_BLOCK4 is WRITE4 (see Section 18.32 of <xref target='RFC8881'
          format='default' sectionFormat='of' />) with additional semantics
          over the block_owner and the activation of blocks. As such, all of
          the normal semantics of WRITE4 directly apply.
        </t>
        <t>
          The main difference between the two operations is that WRITE_BLOCK4
          works on blocks and not a raw data stream. As such wba_offset
          is the starting block offset in the file and not the byte
          offset in the file. Some erasure encoding types can have
          different block sizes depending on the
          block type. Further, wbr_count is a count of written blocks
          and not written bytes.
        </t>
        <t>
          If wba_stable is FILE_SYNC4, the
          data server <bcp14>MUST</bcp14> commit the written header and block data plus all file system metadata to
          stable storage before returning results.  This corresponds to the
          NFSv2 protocol semantics.  Any other behavior constitutes a protocol
          violation.  If wba_stable is DATA_SYNC4, then the data server <bcp14>MUST</bcp14> commit all
          of the header and block data  to stable storage and enough of the metadata to retrieve
          the data before returning.  The data server implementer is free to
          implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a
          possible performance drop.  If wba_stable is UNSTABLE4, the data server is
          free to commit any part of the header and  block data and the metadata to stable
          storage, including all or none, before returning a reply to the
          client.  There is no guarantee whether or when any uncommitted data
          will subsequently be committed to stable storage.  The only
          guarantees made by the data server are that it will not destroy any data
          without changing the value of writeverf and that it will not commit
          the data and metadata at a level less than that requested by the
          client.
        </t>
        <t>
          The activation of header and block data interacts with the bo_activated
          for each of the written blocks. If the data is not committed
          to stable storage then the bo_activated field <bcp14>MUST NOT</bcp14>
          be set to true. Once the data is committed to stable storage, then
          the data server can set the block's bo_activated if one of these
          conditions apply:
        </t>
        <ul>
          <li>
            it is the first write to that block and the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY flag is set
          </li>
          <li>
            the ACTIVATE_BLOCK4 is issued later for that block.
          </li>
        </ul>
        <t>
          There are subtle interactions with write holes caused by racing
          clients. One client could win the race in each case, but because
          it used a wba_stable of UNSTABLE4, the subsequent writes from
          the second client with a wba_stable of FILE_SYNC4 can be awarded
          the bo_activated being set to true for each of the blocks in
          the payload.
        </t>
        <t>
          Finally, the interaction of wba_stable can cause a client to
          mistakenly believe that by the time it gets the response of
          bo_activated of false, that the blocks are not activated. A
          subsequent READ_BLOCK4 or READ_BLOCK_STATUS4 might show that
          the bo_activated is true without any interaction by the client
          via ACTIVATE_BLOCK4.

          <cref anchor='AI26' source='TH'>
            Automatic setting of bo_activated to true if it is the first
            write should be a performance boost. But it can lead to
            the client having incorrect information (as above) and
            trying to ACTIVATE_BLOCK4 a payload that has lost the race.
            But is that bad? If you have racing clients, there is
            no guarantee at all as to the contents of the file.
          </cref>
        </t>

        <section anchor='guarded_write' numbered='true' removeInRFC='false' toc='exclude'>
          <name>Guarding the Write</name>
          <t>
            A guarded WRITE_BLOCK4 is when the writing of a block
            <bcp14>MUST</bcp14> fail if wba_guard.wbg_check is set
            and the target block does not have both the same change_id
            as the gbo_change_id and the same client_id as the
            gbo_client_id. This is useful in read-update-write
            scenarios. The client reads a block, updates it, and
            is prepared to write it back. It guards the write such
            that if another writer has modified the block, the
            data server will reject the modification.
          </t>
          <t>
            Note that as the guard_block_owner4 (see <xref target='code_guard_block_owner4' />
            does not have a block_id and the WRITE_BLOCK4 applies
            to all blocks in the range of wba_offset to the length
            of wba_data, then each of the target blocks
            <bcp14>MUST</bcp14> have the same change_id and
            client_id. The client <bcp14>SHOULD</bcp14> present
            the smallest set of blocks as possible to meet this
            requirement.
          </t>

          <t>
            <cref anchor='AI27' source='TH'>
              And the complexity goes up here. Does the DS reject only
              based on active blocks? Or can inactive ones also cause
              rejection?
            </cref>
          </t>

          <t>
            <cref anchor='AI28' source='TH'>
              Is the DS supposed to vet all blocks first or proceed to
              the first error? Or do all blocks and return an array
              of errors? (This last one is a no-go.) Also, if we do
              the vet first, what happens if a WRITE_BLOCK4 comes in
              after the vetting? Are we to lock the file during this
              process. Even if we do that, we still have the issue
              of multiple DSes.
            </cref>
          </t>
        </section>

        <section anchor='update_header' numbered='true' removeInRFC='false' toc='exclude'>
          <name>Updating the Header Only</name>
          <t>
            Some erasure encoding types keep their blocks in plain text and
            have parity blocks in order to provide integrity. A common
            configuration for Reed Solomon is 4 active blocks, 2
            parity blocks, and 2 spares. Assuming 4kB data blocks,
            then each payload delivers 16kB of data and 8kB of parity.
            If the application modifies the first data block, then
            all that needs to change is the first active block and the
            two parity blocks in the payload.
          </t>
          <t>
            In any other approach, only 12kB of the total 24kB has to
            be written to storage. If that is attempted in the Flexible
            Files Version 2 Layout Type, then the payload will be deemed
            as inconsistent. The reason for this is that the change_id
            for the unmodified blocks will not match those of the
            modified blocks.
          </t>
          <t>
            The WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY flag in wb_flags
            can be used to save
            the transmission of the blocks. If it is set, then
            the wb_block is ignored. It <bcp14>MUST</bcp14> be
            empty. Note that the client <bcp14>MUST</bcp14> only
            modify both the wb_crc and the wba_owner.bo_change_id
            fields in this case. The wb_crc <bcp14>MUST</bcp14>
            change as the wba_owner.bo_change_id has been modified
            (see <xref target='calculating_crc' />).
          </t>
          <t>
            For the
            purpose of computing the activation state of the block,
            The data server <bcp14>MUST</bcp14> treat this as
            an overwrite. Thus, in the response, bo_activated
            <bcp14>MUST</bcp14> be false.
          </t>
        </section>
      </section>
    </section>
 </section>

  <section anchor='xdr_desc' numbered='true' removeInRFC='false' toc='default'>
    <name>Extraction of XDR</name>
    <t>
      This document contains the external data representation (XDR)
      <xref target='RFC4506' format='default' sectionFormat='of'/> description of
      the Flexible Files Version 2 Layout Type.  The XDR description is embedded in this
      document in a way that makes it simple for the reader to extract
      into a ready-to-compile form.  The reader can feed this document
      into the following shell script to produce the machine readable
      XDR description of the new flags:
    </t>
    <sourcecode type='xdr'>
#!/bin/sh
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
    </sourcecode>
    <t>
      That is, if the above script is stored in a file called 'extract.sh', and
      this document is in a file called 'spec.txt', then the reader can do:
    </t>
    <sourcecode type='xdr'>
sh extract.sh &lt; spec.txt &gt; erasure_coding_prot.x
    </sourcecode>
    <t>
      The effect of the script is to remove leading white space from each
      line, plus a sentinel sequence of '///'.  XDR descriptions with the
      sentinel sequence are embedded throughout the document.
    </t>
    <t>
      Note that the XDR code contained in this document depends on types
      from the NFSv4.2 nfs4_prot.x file (generated from
      <xref target='RFC7863' format='default' sectionFormat='of'/>)
      and the Flexible Files Layout Type flexfiles.x file (generated from
      <xref target='RFC8435' format='default' sectionFormat='of'/>).
      This includes both nfs types that end with a 4, such as offset4,
      length4, etc., as well as more generic types such as uint32_t and
      uint64_t.
    </t>
    <t>
      While the XDR can be appended to that from
      <xref target='RFC7863' format='default' sectionFormat='of'/>,
      the various code snippets belong in their respective areas of
      that XDR.
    </t>
  </section>

  <section anchor='sec_security' numbered='true' removeInRFC='false' toc='default'>
    <name>Security Considerations</name>
    <t>
      This document has the same security considerations as both Flex Files
      Layout Type version 1 (see Section 15 of <xref target='RFC8435'
      format='default' sectionFormat='of' />) and NFSv4.2 (see Section 17 of <xref
      target='RFC7862' format='default' sectionFormat='of' />).
    </t>
  </section>

  <section anchor='sec_iana' numbered='true' removeInRFC='false' toc='default'>
    <name>IANA Considerations</name>

    <section anchor='sec_iana_layouts' numbered='true' removeInRFC='false' toc='default'>
      <name>pNFS Layout Types Registry</name>
      <t>
        <xref target='RFC8881' format='default' sectionFormat='of' />
        introduced the 'pNFS Layout Types Registry'; new layout type
        numbers in this registry need to be assigned by IANA.  This document
        defines the protocol associated with an existing layout type number:
        LAYOUT4_FLEX_FILES_V2 (see <xref target='layoutlist' />).
      </t>

      <table anchor='layoutlist'>
        <name>Layout Type Assignments</name>
        <thead>
          <tr>
            <th>Layout Type Name</th>
            <th>Value</th>
            <th>RFC</th>
            <th>How</th>
            <th>Minor Versions</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>LAYOUT4_FLEX_FILES_V2</td> <td>0x6</td> <td>RFCTBD10</td> <td>L</td> <td>1</td>
          </tr>
        </tbody>
      </table>
    </section>

    <section anchor='sec_iana_recallable' numbered='true' removeInRFC='false' toc='default'>
      <name>NFSv4 Recallable Object Types Registry</name>
      <t>
        <xref target='RFC8881' format='default' sectionFormat='of' /> also
        introduced the 'NFSv4 Recallable Object Types Registry'.  This document
        defines new recallable objects for RCA4_TYPE_MASK_FFV2_LAYOUT_MIN and
        RCA4_TYPE_MASK_FFV2_LAYOUT_MAX (see <xref target='recalllist' />).
      </t>

      <table anchor='recalllist'>
        <name>Recallable Object Type Assignments</name>
        <thead>
          <tr>
            <th>Recallable Object Type Name</th>
            <th>Value</th>
            <th>RFC</th>
            <th>How</th>
            <th>Minor Versions</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>RCA4_TYPE_MASK_FFV2_LAYOUT_MIN</td> <td>20</td> <td>RFCTBD10</td> <td>L</td> <td>1</td>
          </tr>
          <tr>
            <td>RCA4_TYPE_MASK_FFV2_LAYOUT_MAX</td> <td>21</td> <td>RFCTBD10</td> <td>L</td> <td>1</td>
          </tr>
        </tbody>
      </table>
    </section>

    <section anchor='sec_iana_encoding' numbered='true' removeInRFC='false' toc='default'>
      <name>Flexible Files Version 2 Layout Type Erasure Encoding Type Registry</name>
      <t>
        This document introduces the 'Flexible Files Version 2 Layout Type Erasure Encoding Type Registry'. This
        document defines the FFV2_ENCODING_MIRRORED type for Client-Side Mirroring
        (see <xref target='erasure_encoding' />).
      </t>

      <table anchor='erasure_encoding'>
        <name>Flexible Files Version 2 Layout Type Erasure Encoding Type Assignments</name>
        <thead>
          <tr>
            <th>Erasure Encoding Type Name</th>
            <th>Value</th>
            <th>RFC</th>
            <th>How</th>
            <th>Minor Versions</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>FFV2_ENCODING_MIRRORED</td> <td>1</td> <td>RFCTBD10</td> <td>L</td> <td>2</td>
          </tr>
        </tbody>
      </table>
    </section>
  </section>

</middle>

<back>

<references>
  <name>References</name>

  <references>
  <name>Normative References</name>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7530.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml'/>
  </references>

  <references>
  <name>Informative References</name>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1813.xml'/>
    <reference anchor='Plank97' target='http://web.eecs.utk.edu/~jplank/plank/papers/CS-96-332.html'>
      <front>
        <title>A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like System</title>
        <author fullname='James S. Plank' initials='J.' surname='Plank'>
        </author>
        <date month='September' year='1997'/>
      </front>
    </reference>
  </references>
</references>

<section numbered='true' removeInRFC='false' toc='default'>
  <name>Acknowledgments</name>
  <t>
    The following from Hammerspace were instrumental in driving
    Flex Files v2: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron,
    Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski.
  </t>
  <t>
    Christoph Helwig was instrumental in making sure Flexible Files
    Version 2 Layout Type
    was applicable to more than one Erasure-Encoding Type.
  </t>
</section>

<section numbered='true' removeInRFC='true' toc='default'>
  <name>RFC Editor Notes</name>

  <t>
    [RFC Editor: prior to publishing this document as an RFC, please
    replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
    RFC number of this document]
  </t>
</section>

</back>

</rfc>
