Erasure Encoding of Files in NFSv4.2

Erasure Encoding of Files in NFSv4.2 Hammerspace

loghyr@gmail.com

Transport Network File System Version 4 NFSv4 Parallel NFS (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The Flexible File Version 2 Layout Type is defined in this document as an extension to pNFS that allows the use of storage devices that require only a limited degree of interaction with the metadata server and use already-existing protocols. Data replication is also added to provide integrity. Discussion of this draft takes place on the NFSv4 working group mailing list (nfsv4@ietf.org), which is archived at . Working Group information can be found at . This draft starts sparse and will be filled in as details are ironed out. For example, WRITE_BLOCK4 in is presented as being WRITE4 (see Section 18.32 of ) plus some semantic changes. In the first draft, we simply explain the semantics changes. As these are accepted by the knowledgeable reviewers, we will flesh out the WRITE_BLOCK4 section to include sub-sections more akin to 18.32.3 and 18.32.4 of . Except where called out, all the semantics of the Flexible File Version 1 Layout Type presented in still apply. This new version extends it and does not replace it.

Introduction In Parallel NFS (pNFS) (see Section 12 of ), the metadata server returns layout type structures that describe where file data is located. There are different layout types for different storage systems and methods of arranging data on storage devices. defined the Flexible File Version 1 Layout Type used with file-based data servers that are accessed using the NFS protocols: NFSv3 , NFSv4.0 , NFSv4.1 , and NFSv4.2 . The Client Side Mirroring (see Section 8 of ), introduced with the first version of the Flexible File Layout Type, provides for replication of data but does not provide for integrity of data. In the event of an error, an user would be able to repair the file by silvering the mirror contents. I.e., they would pick one of the mirror instances and replicate it to the other instance locations. However, lacking integrity checks, silent corruptions are not able to be detected and the choice of what constitutes the good copy is difficult. This document updates the Flexible File Layout Type to version 2 by providing data integrity for erasure encoding. It introduces new variants of COMMIT4 (see Section 18.3 of ) , READ4 (see Section 18.22 of ) , and WRITE4 (see Section 18.32 of ) to allow for the transmission of integrity checking. Using the process detailed in , the revisions in this document become an extension of NFSv4.2 . They are built on top of the external data representation (XDR) generated from .

Definitions

block:: One of the resulting blocks to be exchanged with a data server after a transformation has been applied to a data block. Note that the resulting block may be a different size than the data block.
Client Side Mirroring:: A file based replication method where copies are maintained in parallel.
data block:: A block of data in the client's cache for a file.
Erasure Encoding:: A data protection scheme where a block of data is replicated into fragments and additional redundant fragments are added to achieve parity. The new blocks are stored in different locations.
Client Side Erasure Encoding:: A file based integrity method where copies are maintained in parallel.
consistency of payload:: A payload is consistent when all contained blocks have the same owner, i.e., they share the same writing client and transaction id.
integrity of data:: Data integrity refers to the accuracy, consistency, and reliability of data throughout its life cycle.
payload:: The set of metadata header and transformed blocks generate per data block by the erasure encoding type. Note that the resulting blocks might be of type active, parity, spare, or repair.
replication of data:: Data replication is making and storing multiple copies of data in different locations.
write hole:: A write hole is a data corruption scenario where either two clients are trying to write to the same block or one client is overwriting an existing block of data.

Requirements Language The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and 'OPTIONAL' in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Flexible File Version 2 Layout Type In order to introduce erasure encoding to pNFS, a new layout type of LAYOUT4_FLEX_FILES_V2 needs to be defined. While we could define a new layout type per erasure encoding type, there exist use cases where multiple erasure encoding types exist in the same layout. The original layouttype4 introduced in is modified to as in .

enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_BLOCK_VOLUME = 3, LAYOUT4_FLEX_FILES = 4, LAYOUT4_FLEX_FILES_V2 = 5 }; struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; }; struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; }; This document defines structures associated with the layouttype4 value LAYOUT4_FLEX_FILES_V2. specifies the loc_body structure as an XDR type 'opaque'. The opaque layout is uninterpreted by the generic pNFS client layers but is interpreted by the Flexible File Version 2 Layout Type implementation. This section defines the structure of this otherwise opaque value, ffv2_layout4.

ffv2_encoding_type

/// enum ffv2_encoding_type { /// FFV2_ENCODING_MIRRORED = 0x1; /// }; The ffv2_encoding_type (see ) encompasses a new IANA registry for 'Flex Files V2 Erasure Encoding Type Registry' (see ). I.e., instead of defining a new Layout Type for each Erasure Encoding, we define a new Erasure Encoding Type. Except for FFV2_ENCODING_MIRRORED, each of the types is expected to employ the new operations in this document. FFV2_ENCODING_MIRRORED offers replication of data and not integrity of data. As such, it does not need operations like WRITE_BLOCK4 (see ).

ff_flags4

const FF_FLAGS_NO_LAYOUTCOMMIT4 = 0x00000001; const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; const FF_FLAGS_NO_READ_IO = 0x00000004; const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; typedef uint32_t ff_flags4; ff_flags4 is defined as in Section 5.1 of and is shown in for reference.

ffv2_file_info4

/// struct ffv2_file_info4 { /// stateid4 fffi_stateid; /// nfs_fh4 fffi_fh_vers; /// }; The ffv2_file_info4 is a new structure to help with the stateid issue discussed in Section 5.1 of . I.e., in version 1 of the Flexible File Layout Type, there was the singleton ffds_stateid combined with the ffds_fh_vers array. I.e., each NFSv4 version has its own stateid. In , each NFSv4 file handle has a one-to-one correspondence to a stateid.

ffv2_ds_flags4

/// const FFV2_DS_FLAGS_ACTIVE = 0x00000001; /// const FFV2_DS_FLAGS_SPARE = 0x00000002; /// const FFV2_DS_FLAGS_PARITY = 0x00000004; /// const FFV2_DS_FLAGS_REPAIR = 0x00000008; /// typedef uint32_t ffv2_ds_flags4; The ffv2_layout4 (in ) flags detail the state of the data servers. With Erasure Encoding algorithms, there are both Systematic and Non-Systematic approaches. In the Systematic, the bits for integrity are placed amoungst the resulting transformed block. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE data servers. The FFV2_DS_FLAGS_SPARE ones allow the client to repair a payload with enaging the metadata server. I.e., if one of the FFV2_DS_FLAGS_ACTIVE did not respond to a WRITE_BLOCK4, the client could fail the block to the FFV2_DS_FLAGS_SPARE data server. With the Non-Systematic approach, the data and integrity live on different data servers. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_PARITY data servers. If the implementation wanted to allow for local repair, it would also use FFV2_DS_FLAGS_SPARE. Note that with a Non-Systematic approach, it is possible to update parts of the blocks, see . See for further reference to storage layouts for encoding.

ffv2_data_server4

/// struct ffv2_data_server4 { /// deviceid4 ffds_deviceid; /// uint32_t ffds_efficiency; /// ffv2_file_info4 ffds_file_info<>; /// fattr4_owner ffds_user; /// fattr4_owner_group ffds_group; /// ffv2_ds_flags4 ffds_flags; /// }; The ffv2_data_server4 (in ) describes a data file and how to access it via the different NFS protocols.

ffv2_encoding_type_data

/// union ffv2_encoding_type_data switch /// (ffv2_encoding_type fetd_encoding) { /// case FFV2_ENCODING_MIRRORED: /// void; /// }; The ffv2_encoding_type_data (in ) describes erasure encoding type specific fields. I.e., this is how the encoding type can communicate the need for counts of active, spare, parity, and repair types of blocks.

ffv2_mirror4

/// struct ffv2_mirror4 { /// ffv2_data_server4 ffm_data_servers<>; /// ffv2_encoding_type_data ffm_encoding_type_data; /// }; The ffv2_mirror4 (in ) describes the Flexible File Layout Version 2 specific fields.

ffv2_layout4

/// struct ffv2_layout4 { /// length4 ffl_stripe_unit; /// ffv2_mirror4 ffl_mirrors<>; /// ff_flags4 ffl_flags; /// uint32_t ffl_stats_collect_hint; /// }; The ffv2_layout4 (in ) describes the Flexible Files Layout Version 2.

ffv2_layouthint4

/// union ffv2_mirrors_hint switch (ffv2_encoding_type ffmh_type) { /// case FFV2_ENCODING_MIRRORED: /// void; /// }; /// /// struct ffv2_layouthint4 { /// ffv2_encoding_type fflh_supported_types<>; /// ffv2_mirrors_hint fflh_mirrors_hint; /// }; The ffv2_layouthint4 (in ) describes the layout_hint (see Section 5.12.4 of ) that the client can provide to the metadata server.

Mixing of Encoding Types Note that effectively, multiple encoding types can be present in a Flexible Files Version 2 Layout Type layout. The ffv2_layout4 has an array of ffv2_mirror4, each of which has a ffv2_encoding_type. The main reason to allow for this is to provide for either the assimilation of a non-erasure encoded file to an erasure encoded file or the exporting of an erasure encoded file to a non-erasure encoded file. Assume there is an additional ffv2_encoding_type of FFV2_ENCODING_REED_SOLOMON and it needs 4 active blocks, 2 parity blocks, and 2 spare blocks. The user wants to actively assimilate a regular file. As such, a layout might be as represented in . As this is an assimilation, most of the data reads will be satisfied by READ4 (see Section 18.22 of ) calls to index 0. However, as this is also an active file, there could also be READ_BLOCK4 (see ) calls to the other indexes.

+---------------------------------------------------+ | ffv2_layout4: | +---------------------------------------------------+ | ffl_mirrors[0]: | | ffm_data_servers: | | ffv2_data_server4[0] | | ffds_flags: 0 | | ffm_encoding: FFV2_ENCODING_MIRRORED | +---------------------------------------------------+ | ffl_mirrors[1]: | | ffm_data_servers: | | ffv2_data_server4[0] | | ffds_flags: FFV2_DS_FLAGS_ACTIVE | | ffv2_data_server4[1] | | ffds_flags: FFV2_DS_FLAGS_ACTIVE | | ffv2_data_server4[2] | | ffds_flags: FFV2_DS_FLAGS_ACTIVE | | ffv2_data_server4[3] | | ffds_flags: FFV2_DS_FLAGS_ACTIVE | | ffv2_data_server4[4] | | ffds_flags: FFV2_DS_FLAGS_PARITY | | ffv2_data_server4[5] | | ffds_flags: FFV2_DS_FLAGS_PARITY | | ffv2_data_server4[6] | | ffds_flags: FFV2_DS_FLAGS_SPARE | | ffv2_data_server4[7] | | ffds_flags: FFV2_DS_FLAGS_SPARE | | ffm_encoding: FFV2_ENCODING_REED_SOLOMON | +---------------------------------------------------+ When performing I/O via a FFV2_ENCODING_MIRRORED encoding type, the non-transformed data will be used, Whereas with other encoding types, a metadata header and transformed block will be sent. Further, when reading data from the instance files, the client MUST be prepared to have one of the encoding types supply data and the other type not to supply data. I.e., the READ_BLOCK4 call might return rlr_eof set to true (see ), which indicates that there is no data, where the READ4 call might return eof to be false, which indicates that there is data. The client MUST determine that there is in fact data. An example use case is the active assimilation of a file to ensure integrity. As the client is helping to translated the file to the new encoding scheme, it is actively modifying the file. As such, it might be sequentially reading the file in order to translate. The READ4 call would be returning data and the READ_BLOCK4 would not be returning data. As the client overwrites the file, the WRITE4 call and the WRITE_BLOCK4 call would both have data sent. Finally, if the client read back a section which had been modified earlier, both the READ4 and READ_BLOCK4 calls would return data.

Erasure Encoding Erasure Encoding takes an data block and transforms it to a payload to send to the data servers (see ). It generates a metadata header and transformed block per data server. The header is metadata information for the transformed block. From now on, the metadata is simply referred to as the header and the transformed block as the block. The payload of a data block is the set of generated headers and blocks for that data block. The change_id is an unique identifier generated by the client to describe the current write transaction. The client_id is an unique identifier assigned by the metadata server to describe which client is making the current write transaction. The seq_id describes the index across payload. The eff_len is the length of the data within the block. Finally, the crc32 is the 32 bit crc calculation of the header (with the crc32 field being 0) and the block. By combining the two parts of the payload, integrity is ensured for both the parts. While the data block might have a length of 4kB, that does not necessarily mean that the length of the block is 4kB. That length is determined by the erasure encoding type algorithm. For example, Reed Solomon might have 4kB blocks with the data integrity being compromised by parity blocks. Another example would be the Mojette Transformation, which might have 1kB block lengths. The payload contains redundancy which will allow the erasure encoding type algorithm to repair blocks in the payload as it is transformed back to a data block (see ). A payload is consistent when all of the contained headers share the same change_id and client_id. It has integrity when it is consistent and the blocks all pass the crc32 checks.

Encoding a Data Block

+-----------------+ | data block | +-----------------+ | | | 3kB data | | | +-----------------+ | 1kB empty | +-------+---------+ | | +----------------------+-----------------------+ | Erasure Encoding (Transform Forward) | +----+-------------------------------------+---+ | | | | +---+----------------+ +----------+---------+ | HEADER | | HEADER | +--------------------+ +--------------------+ | change_id: 3 | | change_id: 3 | | client_id: 6 | | client_id: 6 | | seq_id : 0 | | seq_id : 5 | | eff_len : 3kB | ... | eff_len : 3kB | | crc32 : | | crc32 : | +--------------------+ +--------------------+ | BLOCK | | BLOCK | +--------------------+ +--------------------+ | data: .... | | data: .... | +--------------------+ +--------------------+ Data Server 1 Data Server 6 Each data block of the file resident in the client's cache of the file will be encoded into N different payloads to be sent to the data servers as shown in . As WRITE_BLOCK4 (see ) can encode multiple write_block4 into a single transaction, a more accurate description of a WRITE_BLOCK4 might be as in .

Calculating the CRC32

Decoding a Data Block

Checking the CRC32

+------------------------------------+ | READ_BLOCK4resok | +------------------------------------+ | rbr_eof: false | | rbr_blocks[0]: | | rb_crc: 0x21de8 | | rb_effective_len : 3kB | | rb_owner: | | bo_block_id: 1 | | bo_change_id: 7 | | bo_client_id: 6 | | bo_activated: true | | rb_block : ...... | +------------------------------------+ Assuming the READ_BLOCK4 results as in , the crc32 needs to be checked in order to ensure data integrity. Conceptually, a header and payload can be built as shown in . The crc32 is calculated over the 5 fields as shown in the header and the 3kB of data block. In this example, it is calculated to be 0x21de8. Thus this payload for the data server has data integrity.

Blocks and Activating Unlike the regular NFSv4.2 I/O operations, the base unit of I/O in this document is the block. The raw data stream is encoded/decoded into blocks as described in . Each block has the concept of whether it is activated or pending activation. This is crucial in detecting write holes. A write hole occurs either when two different clients write to the same block concurrently or when a client overwrites existing data. In the first scenario, the order of writes is not deterministic and can result in a mixture of blocks in the payload. In the last scenario, network partitions or client restarts can result in partial writes. In both cases, the blocks have to be repaired, either by abandoning the new I/O or by sorting out the winner. Note that unlike the case of the encoding type detecting data integrity issues (see ), the case of write holes is in the scope of this document. What is out of scope of this document is the manner in which the data servers implement the semantics of the new operations. I.e., the data servers might be able to leverage the native file system to achieve the semantics or it might completely implement a multi-file approach to stage WRITE_BLOCK4 results and then shuffle blocks when the ACTIVATE_BLOCK4 or ROLLBACK_BLOCK4 operations activate the data.

Dead or Partitioned Client Consider a client which was in the middle of sending WRITE_BLOCK4 to a set of data servers and it crashes. Regardless of whether it comes back online or not, the metadata server can detect that the client had restarted when it had an outstanding LAYOUTIOMODE4_RW on the file. The metadata server can assign the file to a repair program, which would basically scan the entire file with READ_BLOCK_STATUS4. When it determines that it does not have enough payload blocks to rebuild the data block, it can determine that the I/O for that data block was not complete and throw away the blocks. Note that the repair process can throw away the blocks by using the ROLLBACK_BLOCK4 operation to unstage the pending written blocks.

Client Overwrite Consider a client which gets back conflicting information in the WRITE_BLOCK4 results. Assume that we had written to 6 data servers with WRITE_BLOCK4s as in . And we get the results as in .

Data Server 1 +--------------------------------+ | WRITE_BLOCK4resok | +--------------------------------+ | wbr_count: 2 | | wbr_committed: FILE_SYNC4 | | wbr_writeverf: 0xf1234abc | | wbr_owners[0]: | | bo_block_id: 1 | | bo_change_id: 2 | | bo_client_id: 6 | | bo_activated: true | | wbr_owners[1]: | | bo_block_id: 1 | | bo_change_id: 3 | | bo_client_id: 6 | | bo_activated: false | | wbr_owners[2]: | | bo_block_id: 2 | | bo_change_id: 3 | | bo_client_id: 6 | | bo_activated: true | +--------------------------------+ But assume that data server 4 does not respond to the WRITE_BLOCK4 operation. While the client can detect this and send the WRITE_BLOCK4 to any data server marked as FFV2_DS_FLAGS_SPARE, it might decide to see if the data server did in fact do the transaction. It might also be the case that there are no data servers marked as FFV2_DS_FLAGS_SPARE. The client issues a READ_BLOCK_STATUS4 (see ) and gets the results in . This indicates that data server 4 did not get the WRITE_BLOCK4 request. In general, the client can either resend the WRITE_BLOCK4 request, determine by the erasure encoding type that there is sufficient payload blocks present to decode the data block, or ROLLBACK_BLOCK4 the existing blocks to back out the change.

Racing Clients Assume that the client has written to 6 data servers with WRITE_BLOCK4s as in . But now it gets back the conflicting results in and . From this, it can detect that there was a race with another client. Note, even though both clients present the same bo_change_id, nothing can be inferred as to the ordering of the two transactions. In some cases, bo_client_id 10 won the race and in some cases, bo_client_id 6 won the race. As a subsequent READ_BLOCK4 will produce garbage, the clients need to agree on how to fix this issue without any communication. A simplistic approach is for each client to retry the WRITE_BLOCK4 until such time as the payload is consistent. Note, this does not mean that both clients win, it just means that one of them wins. Another option is for the clients to report a LAYOUTERROR4 (see Section 15.6 of ) to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. That would then allow the metadata server to assign the repairing of the file.

Multiple Writers Note that nothing prevents pending blocks from accumulating or from more than 2 writers trying to write the same payload. An example of such a WRITE_BLOCK4resok in response to the example of is shown in . Note only has client 6 tried to update the block 1, but all of clients 6, 7, and 20 are attempting to update it.

Reader and Writer Racing In addition to the above write hole scenarios, a further complication is a racing reader and writer. If the client reads a block and determines that the payload is not consistent (i.e., not all of the payload blocks share the same client_id and change_id), then it can assume that it has encountered a race with another client writing to the file. It SHOULD retry the READ_BLOCK4 operation until payload consistency is achieved. It may determine to send a LAYOUTERROR4 to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. And should it hang forever? Perhaps a new layout error that the client can send the MDS? Or should it probe with READ_BLOCK_STATUS4 to try to repair? Perhaps a LAYOUTERROR_BLOCK4 to send an encoding type specific location?

New Infrastructure

Errors

Error 10097 - NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT The client encountered a payload in which the blocks were inconsistent and stays inconsistent. As the client can not tell if another client is actively writing, it informs the metadata server of this error via LAYOUTERROR4. The metadata server can then arrange for repair of the file. Note that due to the opaqueness of the clientid4, the client can not differentiate between boot instances of the metadata server or client, but the metadata server can do that differentiation. I.e., it can tell if the inconsistency is from the same client, whether that client is active and actively writing to the file (i.e., does the client have the file open and with a LAYOUTIOMODE4_RW layout?).

Error 10098 - NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED The client requested a ffv2_encoding_type which the metadata server does not support. I.e., if the client sends a layout_hint requesting an erasure encoding type that the metadata server does not support, this error code can be returned. The client might have to send the layout_hint several times to determine the overlapping set of supported erasure encoding types.

Error 10099 - NFS4ERR_ERASURE_ENCODING_BLOCK_MISMATCH The client requested to the data server to update the header only and the data server can not find a matching block at that offset.

EXCHGID4_FLAG_USE_PNFS_DS

/// const EXCHGID4_FLAG_USE_ERASURE_DS = 0x00100000; When a data server connects to a metadata server it can via EXCHANGE_ID (see Section 18.35 of ) state its pNFS role. The data server can use EXCHGID4_FLAG_USE_ERASURE_DS (see ) to indicate that it supports the new NFSv4.2 operations introduced in this document. Section 13.1 describes the interaction of the various pNFS roles masked by EXCHGID4_FLAG_MASK_PNFS. However, that does not mask out EXCHGID4_FLAG_USE_ERASURE_DS. I.e., EXCHGID4_FLAG_USE_ERASURE_DS can be used in combination with all of the pNFS flags. If the data server sets EXCHGID4_FLAG_USE_ERASURE_DS during the EXCHANGE_ID operation, then it MUST support: ACTIVATE_BLOCK4, READ_BLOCK_STATUS4, READ_BLOCK4, ROLLBACK_BLOCK4, and WRITE_BLOCK4. Further, note that this support is orthoganol to the Erasure Encoding Type selected. The data server is unaware of which type is driving the I/O. It is also unaware of the payload layout or what type of block it is serving.

Block Owner

/// struct block_owner4 { /// uint32_t bo_block_id; /// changeid4 bo_change_id; /// clientid4 bo_client_id; /// bool bo_activated; /// }; The block_owner4 (see ) is used to determine when and by whom a block was written. The bo_block_id is used to identify the block and MUST be the index of the block within the file. I.e., it is the offset of the start of the block divided by the block len. The bo_client_id MUST be the client id handed out by the metadata server to the client as the eir_clientid during the EXCHANGE_ID results (see Section 18.35 of ) and MUST NOT be the client id supplied by the data server to the client. I.e., across all data files, the bo_client_id uniquely describes one and only one client. The bo_change_id is like the change attribute (see Section 5.8.1.4 of ) in that each block write by a given client has to have an unique bo_change_id. I.e., it can be determined which transaction across all data files that a block corresponds. The bo_activated is used by the data server to indicate whether the block I/O was activated or pending activation. The first WRITE_BLOCK4 to a location is automatically activated if the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY is set. Subsequent WRITE_BLOCK4 modifications to that block location are not automatically activated. The client has to ACTIVATE_BLOCK4 the block in order to get it activated. The concept of automatically activating is dependent on the wba_stable field of the WRITE_BLOCK4args.

New NFSv4.2 Operations

Operation 77: ACTIVATE_BLOCK4 - Activate Cached Block Data

ARGUMENTS

/// struct ACTIVATE_BLOCK4args { /// /* CURRENT_FH: file */ /// offset4 aba_offset; /// count4 aba_count; /// block_owner4 aba_blocks<>; /// };

RESULTS

/// struct ACTIVATE_BLOCK4resok { /// verifier4 abr_writeverf; /// };

/// union ACTIVATE_BLOCK4res switch (nfsstat4 abr_status) { /// case NFS4_OK: /// ACTIVATE_BLOCK4resok abr_resok4; /// default: /// void; /// };

DESCRIPTION ACTIVATE_BLOCK4 is COMMIT4 (see Section 18.3 of ) with additional semantics over the block_owner activating the blocks. As such, all of the normal semantics of COMMIT4 directly apply. The main difference between the two operations is that ACTIVATE_BLOCK4 works on blocks and not a raw data stream. As such aba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, aba_count is a count of blocks to activate and not bytes to activate. Further, while it may appear that the combination of aba_offset and aba_count are redundant to aba_blocks, the purpose of aba_blocks is to allow the data server to differentiate between potentially multiple pending blocks.

Operation 78: READ_BLOCK_STATUS4 - Read Block Commit Status from File

ARGUMENTS

/// struct READ_BLOCK_STATUS4args { /// /* CURRENT_FH: file */ /// stateid4 rbsa_stateid; /// offset4 rbsa_offset; /// count4 rbsa_count; /// };

RESULTS

/// struct READ_BLOCK_STATUS4resok { /// bool rbsr_eof; /// block_owner4 rbsr_blocks<>; /// };

/// union READ_BLOCK_STATUS4res switch (nfsstat4 rbsr_status) { /// case NFS4_OK: /// READ_BLOCK4resok rbsr_resok4; /// default: /// void; /// };

DESCRIPTION READ_BLOCK_STATUS4 differs from READ_BLOCK4 in that it only reads active and pending headers in the desired data range.

Operation 79: READ_BLOCK4 - Read Blocks from File

ARGUMENTS

/// struct READ_BLOCK4args { /// /* CURRENT_FH: file */ /// stateid4 rba_stateid; /// offset4 rba_offset; /// count4 rba_count; /// };

RESULTS

/// struct read_block4 { /// uint32_t rb_crc; /// uint32_t rb_effective_len; /// block_owner4 rb_owner; /// uint32_t rb_seq_id; /// opaque rb_block<>; /// };

/// struct READ_BLOCK4resok { /// bool rbr_eof; /// read_block4 rbr_blocks<>; /// };

/// union READ_BLOCK4res switch (nfsstat4 rbr_status) { /// case NFS4_OK: /// READ_BLOCK4resok rbr_resok4; /// default: /// void; /// };

DESCRIPTION READ_BLOCK is READ4 (see Section 18.22 of ) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of READ4 directly apply. The main difference between the two operations is that READ_BLOCK works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to read and not bytes to read. READ_BLOCK also only returns the activated block at the location. I.e., if a client overwrites a block at offset 10, then tries to read the block without activating it, then the original block is returned. When reading a set of blocks across the data servers, it can be the case that some data servers do not have any data at that location. In that case, the server either returns rbr_eof if the rba_offset exceeds the number of blocks that the data server is aware or it returns an empty block for that block. For example, in , the client asks for 4 blocks starting with the 3rd block in the file. The second data server responds as in . The client would read this as there is valid data for blocks 2 and 4, there is a hole at block 3, and there is no data for block 5. Note that the data server MUST calculate a valid rb_crc for block 3 based on the generated fields.

Operation 80: ROLLBACK_BLOCK - Rollback Cached Block Data

ARGUMENTS

/// struct ROLLBACK_BLOCK4args { /// /* CURRENT_FH: file */ /// offset4 rba_offset; /// count4 rba_count; /// block_owner4 rba_blocks<>; /// };

RESULTS

/// struct ROLLBACK_BLOCK4resok { /// verifier4 rbr_writeverf; /// };

/// union ROLLBACK_BLOCK4res switch (nfsstat4 rbr_status) { /// case NFS4_OK: /// ROLLBACK_BLOCK4resok rbr_resok4; /// default: /// void; /// };

DESCRIPTION ROLLBACK_BLOCK4 is a new form like COMMIT4 (see Section 18.3 of ) with additional semantics over the block_owner the rolling back the writing of blocks. As such, all of the normal semantics of COMMIT4 directly apply. The main difference between the two operations is that ROLLBACK_BLOCK4 works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to rollback and not bytes to rollback. Further, while it may appear that the combination of rba_offset and rba_count are redundant to rba_blocks, the purpose of rba_blocks is to allow the data server to differentiate between potentially multiple pending blocks. ROLLBACK_BLOCK4 deletes prior WRITE_BLOCK4 transactions. In case of write holes, it allows the client to undo transactions to repair the file.

Operation 81: WRITE_BLOCK4 - Write Blocks to File

ARGUMENTS

/// const WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY = 0x00000001; /// const WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY = 0x00000002;

/// struct write_block4 { /// uint32_t wb_crc; /// uint32_t wb_effective_len; /// uint32_t wb_flags; /// opaque wb_block<>; /// };

/// struct guard_block_owner4 { /// changeid4 gbo_change_id; /// clientid4 gbo_client_id; /// };

/// union write_block_guard4 (bool wbg_check) { /// case TRUE: /// guard_block_owner4 wbg_block_owner; /// case FALSE: /// void; /// };

/// struct WRITE_BLOCK4args { /// /* CURRENT_FH: file */ /// stateid4 wba_stateid; /// offset4 wba_offset; /// stable_how4 wba_stable; /// block_owner4 wba_owner; /// uint32_t wba_seq_id; /// write_block_guard4 wba_guard; /// write_block4 wba_data<>; /// };

RESULTS

/// struct WRITE_BLOCK4resok { /// count4 wbr_count; /// stable_how4 wbr_committed; /// verifier4 wbr_writeverf; /// block_owner4 wbr_owners<>; /// };

/// union WRITE_BLOCK4res switch (nfsstat4 wbr_status) { /// case NFS4_OK: /// WRITE_BLOCK4resok wbr_resok4; /// default: /// void; /// };

DESCRIPTION WRITE_BLOCK4 is WRITE4 (see Section 18.32 of ) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of WRITE4 directly apply. The main difference between the two operations is that WRITE_BLOCK4 works on blocks and not a raw data stream. As such wba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, wbr_count is a count of written blocks and not written bytes. If wba_stable is FILE_SYNC4, the data server MUST commit the written header and block data plus all file system metadata to stable storage before returning results. This corresponds to the NFSv2 protocol semantics. Any other behavior constitutes a protocol violation. If wba_stable is DATA_SYNC4, then the data server MUST commit all of the header and block data to stable storage and enough of the metadata to retrieve the data before returning. The data server implementer is free to implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a possible performance drop. If wba_stable is UNSTABLE4, the data server is free to commit any part of the header and block data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the data server are that it will not destroy any data without changing the value of writeverf and that it will not commit the data and metadata at a level less than that requested by the client. The activation of header and block data interacts with the bo_activated for each of the written blocks. If the data is not committed to stable storage then the bo_activated field MUST NOT be set to true. Once the data is committed to stable storage, then the data server can set the block's bo_activated if one of these conditions apply:

it is the first write to that block and the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY flag is set
the ACTIVATE_BLOCK4 is issued later for that block.

There are subtle interactions with write holes caused by racing clients. One client could win the race in each case, but because it used a wba_stable of UNSTABLE4, the subsequent writes from the second client with a wba_stable of FILE_SYNC4 can be awarded the bo_activated being set to true for each of the blocks in the payload. Finally, the interaction of wba_stable can cause a client to mistakenly believe that by the time it gets the response of bo_activated of false, that the blocks are not activated. A subsequent READ_BLOCK4 or READ_BLOCK_STATUS4 might show that the bo_activated is true without any interaction by the client via ACTIVATE_BLOCK4. Automatic setting of bo_activated to true if it is the first write should be a performance boost. But it can lead to the client having incorrect information (as above) and trying to ACTIVATE_BLOCK4 a payload that has lost the race. But is that bad? If you have racing clients, there is no guarantee at all as to the contents of the file.

Guarding the Write A guarded WRITE_BLOCK4 is when the writing of a block MUST fail if wba_guard.wbg_check is set and the target block does not have both the same change_id as the gbo_change_id and the same client_id as the gbo_client_id. This is useful in read-update-write scenarios. The client reads a block, updates it, and is prepared to write it back. It guards the write such that if another writer has modified the block, the data server will reject the modification. Note that as the guard_block_owner4 (see does not have a block_id and the WRITE_BLOCK4 applies to all blocks in the range of wba_offset to the length of wba_data, then each of the target blocks MUST have the same change_id and client_id. The client SHOULD present the smallest set of blocks as possible to meet this requirement. And the complexity goes up here. Does the DS reject only based on active blocks? Or can inactive ones also cause rejection? Is the DS supposed to vet all blocks first or proceed to the first error? Or do all blocks and return an array of errors? (This last one is a no-go.) Also, if we do the vet first, what happens if a WRITE_BLOCK4 comes in after the vetting? Are we to lock the file during this process. Even if we do that, we still have the issue of multiple DSes.

Updating the Header Only Some erasure encoding types keep their blocks in plain text and have parity blocks in order to provide integrity. A common configuration for Reed Solomon is 4 active blocks, 2 parity blocks, and 2 spares. Assuming 4kB data blocks, then each payload delivers 16kB of data and 8kB of parity. If the application modifies the first data block, then all that needs to change is the first active block and the two parity blocks in the payload. In any other approach, only 12kB of the total 24kB has to be written to storage. If that is attempted in the Flexible Files Version 2 Layout Type, then the payload will be deemed as inconsistent. The reason for this is that the change_id for the unmodified blocks will not match those of the modified blocks. The WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY flag in wb_flags can be used to save the transmission of the blocks. If it is set, then the wb_block is ignored. It MUST be empty. Note that the client MUST only modify both the wb_crc and the wba_owner.bo_change_id fields in this case. The wb_crc MUST change as the wba_owner.bo_change_id has been modified (see ). For the purpose of computing the activation state of the block, The data server MUST treat this as an overwrite. Thus, in the response, bo_activated MUST be false.

Extraction of XDR This document contains the external data representation (XDR) description of the Flexible Files Version 2 Layout Type. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the new flags: #!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' That is, if the above script is stored in a file called 'extract.sh', and this document is in a file called 'spec.txt', then the reader can do: sh extract.sh < spec.txt > erasure_coding_prot.x The effect of the script is to remove leading white space from each line, plus a sentinel sequence of '///'. XDR descriptions with the sentinel sequence are embedded throughout the document. Note that the XDR code contained in this document depends on types from the NFSv4.2 nfs4_prot.x file (generated from ) and the Flexible Files Layout Type flexfiles.x file (generated from ). This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t. While the XDR can be appended to that from , the various code snippets belong in their respective areas of that XDR.

Security Considerations This document has the same security considerations as both Flex Files Layout Type version 1 (see Section 15 of ) and NFSv4.2 (see Section 17 of ).

IANA Considerations

pNFS Layout Types Registry introduced the 'pNFS Layout Types Registry'; new layout type numbers in this registry need to be assigned by IANA. This document defines the protocol associated with an existing layout type number: LAYOUT4_FLEX_FILES_V2 (see ). Layout Type Assignments

Layout Type Name	Value	RFC	How	Minor Versions
LAYOUT4_FLEX_FILES_V2	0x6	RFCTBD10	L	1

NFSv4 Recallable Object Types Registry also introduced the 'NFSv4 Recallable Object Types Registry'. This document defines new recallable objects for RCA4_TYPE_MASK_FFV2_LAYOUT_MIN and RCA4_TYPE_MASK_FFV2_LAYOUT_MAX (see ). Recallable Object Type Assignments

Recallable Object Type Name	Value	RFC	How	Minor Versions
RCA4_TYPE_MASK_FFV2_LAYOUT_MIN	20	RFCTBD10	L	1
RCA4_TYPE_MASK_FFV2_LAYOUT_MAX	21	RFCTBD10	L	1

Flexible Files Version 2 Layout Type Erasure Encoding Type Registry This document introduces the 'Flexible Files Version 2 Layout Type Erasure Encoding Type Registry'. This document defines the FFV2_ENCODING_MIRRORED type for Client-Side Mirroring (see ). Flexible Files Version 2 Layout Type Erasure Encoding Type Assignments

Erasure Encoding Type Name	Value	RFC	How	Minor Versions
FFV2_ENCODING_MIRRORED	1	RFCTBD10	L	2

References Normative References Informative References A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like System

Acknowledgments The following from Hammerspace were instrumental in driving Flex Files v2: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron, Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski. Christoph Helwig was instrumental in making sure Flexible Files Version 2 Layout Type was applicable to more than one Erasure-Encoding Type.

RFC Editor Notes [RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]