<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-guo-nof-requirement-01" ipr="trust200902">
  <front>
    <title abbrev="Abbreviated-Title">Requirement of Fast Fault Detection for
    IP-based SANs</title>

    <author fullname="Liang Guo" initials="L" surname="Guo">
      <organization>CAICT</organization>

      <address>
        <postal>
          <street>No.52, Hua Yuan Bei Road, Haidian District,</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code>100191</code>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>guoliang1@caict.ac.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Yi Feng" initials="Y" surname="Feng">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>fengyiit@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Jizhuang Zhao" initials="J" surname="Zhao">
      <organization>China Telecom</organization>

      <address>
        <postal>
          <street>South District of Future Science and Technology in Beiqijia
          Town, Changping District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>zhaojzh@chinatelecom.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F" surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L" surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road, Haidian District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="11" month="July" year="2022"/>

    <workgroup>Netowork Working Group</workgroup>

    <keyword>Sample</keyword>

    <keyword>Draft</keyword>

    <abstract>
      <t>NVMe over Fabrics defines a common architecture that supports a range
      of storage networking fabrics for NVMe block storage protocol over a
      storage networking fabric, such as Ethernet, Fibre Channel and
      InfiniBand. For IP-based network, RDMA or TCP technology can be used to
      transport NVMe, but the network fault detection is weak.</t>

      <t>This document describes the solution requirements for fast fault
      detection to improve reliability.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>For a long time, the key storage applications and high performance
      requirements are mainly based on FC networks. With the increase of
      transmission rates, the medium has evolved from HDDs to solid-state
      storage, and the protocol has evolved from SATA to NVMe. The emergence
      of new NVMe technologies brings new opportunities. With the development
      of the NVMe protocol, the application scenario of the NVMe protocol is
      extended from PCIe to other fabrics, solving the problem of NVMe
      extension and transmission distance. The block storage protocol uses NoF
      to replace SCSI, reducing the number of protocol interactions from
      application hosts to storage systems. The end-to-end NVMe protocol
      greatly improves performance.</t>

      <t>Fabrics of NoF includes Ethernet, Fibre Channel and InfiniBand.
      Comparing FC-NVMe to Ethernet- or InfiniBand-based Network alternatives
      generally takes into consideration the advantages and disadvantages of
      the networking technologies. Fibre Channel fabrics are noted for their
      lossless data transmission, predictable and consistent performance, and
      reliability. Large enterprises tend to favor FC storage for
      mission-critical workloads. But Fibre Channel requires special equipment
      and storage networking expertise to operate and can be more costly than
      IP-based alternatives. Like FC, InfiniBand is a lossless network
      requiring special hardware. IP-based NVMe storage products tend to be
      more plentiful than FC-NVMe-based options. Most storage startups focus
      on IP-based NVMe. But unlink FC, The Ethernet switch does not notify the
      Change of device status. When the device is faulty, relying on the NVMe
      link heartbeat message mechanism , the host takes tens of seconds to
      complete service failover.</t>

      <t><figure>
          <artwork align="center"><![CDATA[   +--------------------------------------+    
   |          NVMe Host Software          |    
   +--------------------------------------+    
   +--------------------------------------+    
   |   Host Side Transport Abstraction    |    
   +--------------------------------------+    
                                               
      /\      /\      /\      /\      /\       
     /  \    /  \    /  \    /  \    /  \      
      FC      IB     RoCE    iWARP   TCP       
     \  /    \  /    \  /    \  /    \  /      
      \/      \/      \/      \/      \/       
                                               
   +--------------------------------------+    
   |Controller Side Transport Abstraction |    
   +--------------------------------------+    
   +--------------------------------------+    
   |          NVMe SubSystem              |    
   +--------------------------------------+    
]]></artwork>
        </figure></t>

      <t>This document describes the application scenarios and capability
      requirements of the IP-based NVMe that implements fast fault detection
      similar to FC. The proposal is already under discussion in working group
      of NVMe organization.</t>
    </section>

    <section anchor="Security" title="Terminology">
      <t>IP-based NVMe: using RDMA or TCP to transport NVMe through
      Ethernet</t>

      <t>FC: Fiber Channel</t>

      <t>NVMe: Non-Volatile Memory Express</t>

      <t>NoF: NVMe of Fabrics</t>
    </section>

    <section anchor="Acknowledgements" title="Use Case">
      <t>The NVMe over RDMA or TCP IP-based network in storage is as follows,
      the network mainly includes three types of roles: an initiator (referred
      to as a host), a switch, and a target (referred to as a storage device).
      Initiators and targets are also referred to as endpoint devices.</t>

      <t/>

      <t><figure>
          <artwork align="center"><![CDATA[                 +--+      +--+      +--+      +--+      
     Host        |H1|      |H2|      |H3|      |H4|      
  (Initiator)    +/-+      +-,+      +.-+      +/-+      
                  |         | '.   ,-`|         |        
                  |         |   `',   |         |        
                  |         | ,-`  '. |         |        
                +-\--+    +--`-+    +`'--+    +-\--+     
                | SW |    | SW |    | SW |    | SW |     
                +--,-+    +---,,    +,.--+    +-.--+     
                    `.          `'.,`         .`         
                      `.   _,-'`    ``'.,   .`           
         IP           +--'`+            +`-`-+           
    Network           | SW |            | SW |           
                      +--,,+            +,.,-+           
                      .`   `'.,     ,.-``   ',           
                    .`         _,-'`          `.         
                +--`-+    +--'`+    `'---+    +-`'-+     
                | SW |    | SW |    | SW |    | SW |     
                +-.,-+    +-..-+    +-.,-+    +-_.-+     
                  | '.   ,-` |        | `.,   .' |       
                  |   `',    |        |    '.`   |       
                  | ,-`  '.  |        | ,-`  `', |       
    Storage      +-`+      `'\+      +-`+      +`'+      
    (Target)     |S1|      |S2|      |S3|      |S4|      
                 +--+      +--+      +--+      +--+      
]]></artwork>
        </figure></t>

      <t>Hosts and storage devices are connected to the network separately and
      In order to achieve high reliability, each host and storage device are
      connected to dual network planes simultaneously. The host can read and
      write data services when an NVMe connection is established between the
      host and the storage device.</t>

      <t>When a storage device link is faulty during running, the host cannot
      detect the fault status of the indirectly connected device at the
      transport layer. Based on the IP-based NVMe protocol, the host uses the
      NVMe heartbeat to detect the status of the storage device. The heartbeat
      message interval is 5s. Therefore, it takes tens of seconds to determine
      whether the storage device is faulty and perform service switchover
      using the multipath software. Failure tolerance time for core
      applications cannot be reached. In order to obtain the best customer
      experience and business reliability requirement, we need to enhance
      fault detection and failover for IP-based NVMe.</t>

      <t>In this proposl, a fast fault detection solution with switch
      participation is proposed. This scheme utilizes the ability of switches
      to detect faults quickly at the physical layer and link layer, and
      allows the switch to synchronize the detected fault information in the
      IP network, and then notify the fault status to the endpoint
      devices.</t>

      <t>Fault detection procedure: The host can detect the fault status of
      the storage device and quickly switch to the standby path.<list
          style="numbers">
          <t>If a storage fault occurs, the access switch detects the fault at
          the storage network layer or link layer.</t>

          <t>The switch synchronizes the status to other switches on the
          network.</t>

          <t>The switch notifies the storage fault information to the
          hosts.</t>

          <t>Quickly disconnect the connection from the storage device and
          trigger the multipathing software to switch services to the
          redundant path. The fault is detected within 1s.</t>
        </list><figure>
          <artwork align="center"><![CDATA[   +----+       +-------+     +-------+    +-------+ 
   |Host|       |Switch |     |Switch |    |Storage| 
   +----+       +-------+     +-------+    +-------+ 
      |             |            |-+           |     
      |             |            |1|           |     
      |             |            |-+           |     
      |             |<----2------|             |     
      |             |            |             |     
      |<----3-------|            |             |     
      |             |            |             |     
      |<----4-------|------------|-----------> |     
      |             |            |             |     
     
]]></artwork>
        </figure></t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="ODCC-2020-05016">
        <front>
          <title>NVMe over RoCEv2 Network Control Optimization Technical
          Requirements and Test Specifications</title>

          <author fullname="" surname="">
            <organization>Open Data Center Committe</organization>
          </author>

          <date year="2020"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
