<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="exp" docName="draft-nof-requirement-00" ipr="trust200902">
  <front>
    <title abbrev="Abbreviated-Title">NVMe over Fabric Network
    Requirement</title>

    <author fullname="Liang Guo" initials="L" surname="Guo">
      <organization>CAICT</organization>

      <address>
        <postal>
          <street>No.52, Hua Yuan Bei Road, Haidian District,</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code>100191</code>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>guoliang1@caict.ac.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Yi Feng" initials="Y" surname="Feng">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>yangzhiyong@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Jizhuang Zhao" initials="J" surname="Zhao">
      <organization>China Telecom</organization>

      <address>
        <postal>
          <street>South District of Future Science and Technology in Beiqijia
          Town, Changping District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>zhaojzh@chinatelecom.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L" surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road, Haidian District</street>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="7" month="March" year="2022"/>

    <workgroup>Netowork Working Group</workgroup>

    <keyword>Sample</keyword>

    <keyword>Draft</keyword>

    <abstract>
      <t>NVMe over Fabrics defines a common architecture that supports a range
      of storage networking fabrics for NVMe block storage protocol over a
      storage networking fabric, such as Ethernet, Fibre Channel and
      InfiniBand. For Ethernet-based network, RDMA or TCP technology can be
      used to transport NVMe, but the network management mechanism is simple,
      and fault detection is weak.</t>

      <t>This document describes the solution requirements for automatic
      device discovery to improve usability and quick switchover to improve
      reliability.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>For a long time, the key storage applications and high performance
      requirements are mainly based on FC networks. With the increase of
      transmission rates, the medium has evolved from HDDs to solid-state
      storage, and the protocol has evolved from SATA to NVMe. The emergence
      of new NVMe technologies brings new opportunities. With the development
      of the NVMe protocol, the application scenario of the NVMe protocol is
      extended from PCIe to other fabrics, solving the problem of NVMe
      extension and transmission distance. The block storage protocol uses NoF
      to replace SCSI, reducing the number of protocol interactions from
      application hosts to storage systems. The end-to-end NVMe protocol
      greatly improves performance.</t>

      <t>Fabrics of NoF includes Ethernet, Fibre Channel and InfiniBand.
      Comparing FC-NVMe to Ethernet- or InfiniBand-based Network alternatives
      generally takes into consideration the advantages and disadvantages of
      the networking technologies. Fibre Channel fabrics are noted for their
      lossless data transmission, predictable and consistent performance, and
      reliability. Large enterprises tend to favor FC storage for
      mission-critical workloads. But Fibre Channel requires special equipment
      and storage networking expertise to operate and can be more costly than
      Ethernet-based alternatives. Like FC, InfiniBand is a lossless network
      requiring special hardware. Ethernet-based NVMe storage products tend to
      be more plentiful than FC-NVMe-based options. Most storage startups
      focus on Ethernet-based NVMe. But unlike FC, InfiniBand and Ethernet
      lack a discovery service that enables the automatic addition of nodes to
      the fabric. And unlink FC, The Ethernet switch does not have zone
      management and does not notify the Change of device status. When the
      device is faulty, relying on the NVMe link heartbeat message mechanism ,
      the host takes tens of seconds to complete service switchover.</t>

      <t><figure>
          <artwork align="center"><![CDATA[   +--------------------------------------+    
   |          NVMe Host Software          |    
   +--------------------------------------+    
   +--------------------------------------+    
   |   Host Side Transport Abstraction    |    
   +--------------------------------------+    
                                               
      /\      /\      /\      /\      /\       
     /  \    /  \    /  \    /  \    /  \      
      FC      IB     RoCE    iWARP   TCP       
     \  /    \  /    \  /    \  /    \  /      
      \/      \/      \/      \/      \/       
                                               
   +--------------------------------------+    
   |Controller Side Transport Abstraction |    
   +--------------------------------------+    
   +--------------------------------------+    
   |          NVMe SubSystem              |    
   +--------------------------------------+    
]]></artwork>
        </figure></t>

      <t>This document describes the application scenarios and capability
      requirements of the Ethernet-based NVMe that implements automatic device
      discovery, domain management, and fault notification similar to FC.</t>
    </section>

    <section anchor="Security" title="Terminology">
      <t>Ethernet-based NVMe: using RDMA or TCP to transport NVMe through
      Ethernet</t>

      <t>FC: Fiber Channel</t>

      <t>NVMe: Non-Volatile Memory Express</t>

      <t>NoF: NVMe of Fabrics</t>

      <t>CDC: Centralized Discovery Controller</t>
    </section>

    <section anchor="Acknowledgements" title="Use Case">
      <t>The NVMe over RDMA or TCP Ethernet-based network in storage is as
      follows, the network mainly includes three types of roles: an initiator
      (referred to as a host), a switch, and a target (referred to as a
      storage device). Initiators and targets are also referred to as endpoint
      devices. Hosts and storage devices use the Ethernet-based NVMe to
      transmit data over the network to provide high performance storage
      services.</t>

      <t/>

      <t><figure>
          <artwork align="center"><![CDATA[                 +--+      +--+      +--+      +--+      
     Host        |H1|      |H2|      |H3|      |H4|      
  (Initiator)    +/-+      +-,+      +.-+      +/-+      
                  |         | '.   ,-`|         |        
                  |         |   `',   |         |        
                  |         | ,-`  '. |         |        
                +-\--+    +--`-+    +`'--+    +-\--+     
                | SW |    | SW |    | SW |    | SW |     
                +--,-+    +---,,    +,.--+    +-.--+     
                    `.          `'.,`         .`         
                      `.   _,-'`    ``'.,   .`           
   Internet           +--'`+            +`-`-+           
    Network           | SW |            | SW |           
                      +--,,+            +,.,-+           
                      .`   `'.,     ,.-``   ',           
                    .`         _,-'`          `.         
                +--`-+    +--'`+    `'---+    +-`'-+     
                | SW |    | SW |    | SW |    | SW |     
                +-.,-+    +-..-+    +-.,-+    +-_.-+     
                  | '.   ,-` |        | `.,   .' |       
                  |   `',    |        |    '.`   |       
                  | ,-`  '.  |        | ,-`  `', |       
    Storage      +-`+      `'\+      +-`+      +`'+      
    (Target)     |S1|      |S2|      |S3|      |S4|      
                 +--+      +--+      +--+      +--+      
]]></artwork>
        </figure></t>

      <t>Sub-Scenario 1: Initial Deployment</t>

      <t>During initial system deployment, hosts and storage devices are
      connected to the network separately and In order to achieve high
      reliability, each host and storage device are connected to dual network
      planes simultaneously. The host can read and write data services only
      when an NVMe connection is established between the host and the storage
      device. To establish an NVMe connection, the host need to know the IP
      address of the storage device. However, Ethernet-based NVMe lacks a
      discovery service and cannot detect the status of access devices, manual
      configuration of storage IP addresses is required on host. Manual
      configuration is complex and error-prone.</t>

      <t>Sub-Scenario 2: Expansion During expansion, hosts or storage devices
      need to be added. The problem is the same as that in sub-scenario 1.
      When a new host is mounted to a storage device, you need to manually
      configure the storage device, which is complex too.</t>

      <t>Sub-scenario 3: Storage Faults When a storage device is faulty during
      running, no device proactively notifies the host of the fault status.
      Based on the Ethernet-based NVMe protocol, the host uses the NVMe
      heartbeat to detect the status of the storage device. The heartbeat
      message interval is 5s. Therefore, it takes tens of seconds to determine
      whether the storage device is faulty and perform service switchover
      using the multipath software. Failure tolerance time for core
      applications cannot be reached. In order to obtain the best customer
      experience and business reliability requirement, we need to enhance the
      Ethernet-based NVMe, supports automatic device discovery and fault
      status notification. In the CDC(Centralized Discovery Controller)
      solution being discussed by NVMe organizations, hosts and storage
      devices report device information to the CDC, and the CDC synchronizes
      the information to the host. The host establishes an NVMe link to
      implement automatic device discovery. However, the CDC solution does not
      involve fault status notification. Therefore, the system cannot notify a
      fault status of a device, and the failover time is still long. For core
      services, the duration of service impact is critical.</t>

      <t>The solution proposed in this document is similar to the FC-NVMe
      system. The switch functions as the manager of the entire network,
      manages device information and status, and synchronizes information
      between switches on the entire network. In addition, to isolate storage
      services securely, a concept similar to a zone on a Fibre Channel
      network is introduced. Hosts and storage devices are planned in a zone
      and NVMe links can be established between the hosts and storage devices
      in the zone. The detailed requirements for hosts, switches, and storage
      devices on the network are as follows: Automatic device discovery: When
      a storage device is connected to a network, the host can automatically
      discover the storage device and establish an NVMe connection.</t>

      <t><list style="numbers">
          <t>When the host accesses the network, the device access information
          is sent to the switch periodically. When the storage device is
          connected to the network, the device access information is sent to
          the switch periodically.</t>

          <t>After receiving the host and storage access information, the
          switch synchronizes the information to other switches.</t>

          <t>The switch identifies objects in the same zone and synchronizes
          host and storage device information to objects in the zone.</t>

          <t>After receiving the storage device information provided by the
          switch, the host automatically establishes an NVMe connection.</t>
        </list><figure>
          <artwork align="center"><![CDATA[  +----+       +-------+     +-------+    +-------+   
  |Host|       |Storage|     |Switch |    |Switch |   
  +----+       +-------+     +-------+    +-------+   
     |             |            |             |       
     |-----1------>|            |             |       
     |             |-----1----->|             |       
     |             |            |-----2------>|       
     |             |            |             |       
     |             |            |<----2-------|       
     |             |            |             |       
     |<----3-------|-----3------|             |       
     |             |            |             |       
     |-----4------>|            |             |       
     |             |            |             |       
     |             |            |             |       
                                                      
]]></artwork>
        </figure>Fault detection: The host can detect the fault status of the
      storage device and quickly switch to the standby path.<list
          style="numbers">
          <t>The host subscribes to the storage status information from the
          switch.</t>

          <t>If a storage fault occurs, the access switch detects the fault at
          the storage network layer or link layer.</t>

          <t>The switch synchronizes the status to other switches on the
          network.</t>

          <t>The switch identifies the hosts that subscribe to the storage
          status in the zone and synchronizes the storage fault information to
          the hosts.</t>

          <t>Quickly disconnect the connection from the storage device and
          trigger the multipathing software to switch services to the
          redundant path. The fault is detected within 1s.</t>
        </list><figure>
          <artwork align="center"><![CDATA[   +----+       +-------+     +-------+    +-------+ 
   |Host|       |Storage|     |Switch |    |Switch | 
   +----+       +-------+     +-------+    +-------+ 
      |             |            |             |     
      |-----1-------|----------->|             |     
      |             |            |-+           |     
      |             |            |2|           |     
      |             |            |-+           |     
      |             |            |             |     
      |             |            |<----3-------|     
      |<----4-------|------------|             |     
      |             |            |             |     
      |-----4------>|            |             |     
      |             |            |             |     
      |             |            |             |     
]]></artwork>
        </figure></t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="ODCC-2020-05016">
        <front>
          <title>NVMe over RoCEv2 Network Control Optimization Technical
          Requirements and Test Specifications</title>

          <author fullname="" surname="">
            <organization>Open Data Center Committe</organization>
          </author>

          <date year="2020"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
