<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-yy-hpwan-transport-gap-analysis-00"
     ipr="trust200902">
  <front>
    <title abbrev="Web and Internet Transport Working Group">Gap analysis of
    transport protocols of high performance wide area network</title>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Hongwei Yang" initials="H." surname="Yang">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yanghongwei@chinamobile.com</email>
      </address>
    </author>

    <date day="15" month="October" year="2024"/>

    <area>Web and Internet Transport Area</area>

    <workgroup>High Performance Wide Area Network</workgroup>

    <keyword>Data movement; RDMA;</keyword>

    <abstract>
      <t>This document analyzes the throughput performance of existing
      transport protocols under different implementation modes, including
      kernel space based mode, user space based mode, and offloading-based
      implementations, and concludes that existing technologies are either
      limited by host CPU overhead or by the complexity of offloading, and
      cannot guarantee high throughput approaching the bandwidth rate of the
      network adapter. Accordingly, this document proposes new requirements
      for the design of HP-WAN transport protocol.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>HP-WAN requires meeting the needs of massive data transmission over
      long-distant, lossy, and shared wide-area network infrastructure, with
      data transmission volumes typically in the terabytes(TBs) to
      petabytes(PBs) range, and throughput is the key performance indicator.
      The design of transport protocols is critical, including rate control,
      congestion control, and multi-stream processing. Transport protocols
      such as TCP and QUIC, which run in the kernel space, can improve
      end-to-end data transmission throughput to some extent through the
      optimizations of congestion control and multi-path transmission
      algorithms. But they inevitably introduce excessive CPU overhead,
      resulting in that actual throughput cannot approach the bandwidth of the
      endpoint network adapters, and increasing more operational costs.</t>

      <t>Currently, super-fast Ethernet has become the industry trend, with
      commercial products available at 400G and accelerating towards 800G and
      higher T-bit level bandwidth evolution, while the performance growth
      rate of CPUs has become slower, the gap between the two growth rates is
      becoming more obvious, so the utilization rate of the endpoint CPU for
      data transmission is becoming increasingly important for improving
      endpoint throughput.</t>

      <t>This document analyzes the throughput performance of existing
      transport protocols under different implementation modes, including
      kernel space based mode, user space based mode, and offloading-based
      implementations, and concludes that existing technologies are either
      limited by host CPU overhead or by the complexity of offloading, and
      cannot guarantee high throughput approaching the bandwidth rate of the
      network adapter. Accordingly, this document proposes new requirements
      for the design of HP-WAN transport protocol.</t>
    </section>

    <section anchor="definition-of-terms" title="Definition of Terms">
      <t>This document makes use of the following terms:</t>

      <t><list hangIndent="2" style="hanging">
          <t hangText="TOE:">Transport Offload Engine. A series of techniques
          to offload part of TCP protocol stack or TCP processing to network
          adapters.</t>

          <t hangText="RDMA:">Remote Direct Memory Access. Bypass CPU to
          access the memory of the other network endpoint.</t>

          <t hangText="RoCEv2:">RDMA over Converged Ethernet version 2. The
          second version of applying RDMA transport layer which originates
          from Inifiniband over TCP/IP stack.</t>

          <t hangText="iWARP:">internet Wide Area RDMA Protocol. A protocol
          suite that contains several layers and protocols to realize RDMA
          functionality over TCP/IP stack.</t>
        </list></t>

      <t>Even though this document is not a protocol specification, it makes
      use of upper case key words to define requirements unambiguously. The
      key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
      "OPTIONAL" in this document are to be interpreted as described in BCP 14
      <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when,
      they appear in all capitals, as shown here.</t>
    </section>

    <section title="Gap Analysis of Existing Transport Solutions for HP-WAN">
      <t>There are three main ways to implement TCP/IP protocol stack: running
      in user space, running in kernel space, and offloading-based solutions.
      Offloading-based methods can be further divided into partial offloading
      to assist kernels and RDMA. Both fully running in kernel space and
      running in user space lead to very high CPU overhead and suffer from
      throughput performance bottlenecks. Partial offloading methods, such as
      TOE, can reduce CPU overhead to some extent, but still face throughput
      performance bottlenecks. Existing RDMA implementations either require
      support from intermediate network node or have high offloading
      complexity, and still cannot guarantee satisfactory performance
      requirements.</t>

      <section title="TCP/IP Stack Running in Kernel Space">
        <t>TCP, QUIC, and other mainstream transport protocols primarily run
        in the kernel space of operating systems such as Linux. After
        long-term community version iterations and maintenance, these
        protocols have excellent reliability and scalability, such that they
        can meet the performance requirements of most Internet applications,
        such as multimedia and web applications. However, the core requirement
        of HP-WAN services is the extremely high data throughput. Although the
        throughput performance of these kernel-based protocols can also be
        improved, with the maturity of Bottleneck Bandwidth and Round-trip
        propagation time(BBR) <xref target="I-D.cardwell-ccwg-bbr"/>, the CPU
        overhead is still very high.</t>

        <t>Reasons why these kernel-based transport protocols consume high CPU
        resources are that:</t>

        <t>* Frequent copy between kernel space and user space, especially on
        the receiver side, leads to excessive CPU resource consumption.</t>

        <t>* Interruption operations take up a lot of CPU virtual cores and
        there is a contention between the interruption and the processing of
        TCP/IP protocol stack.</t>

        <t>* Multi-stream transmission can improve the total throughput to a
        certain extent, but in general, a stream needs to bind a CPU core, and
        multiple streams will compete for CPU resources, resulting in CPU load
        imbalance and affecting the actual throughput.</t>

        <t>* In modern non-uniform memory access(NUMA) architecture, there is
        a lot of communication overhead between CPU cores.</t>

        <t>* Control messages take up a lot of CPU resources.</t>

        <t>Implementing complete TCP/IP transport protocol stack in the kernel
        space of operating systems inevitably introduces the above problems,
        even though they are not in the scope of the IETF definition, but will
        affect the choice of technologies and protocol design. This is
        especially true when throughput performance metrics are critical. And
        it should be considered in the transport protocol design space.</t>
      </section>

      <section title="TCP/IP Stack Running in User Space">
        <t>The advantage of running complete TCP/IP protocol stack in user
        space is that it can effectively reduce the memory copy overhead
        between user space and kernel space, and even achieve zero-copy, which
        can reduce the processing latency, but it still introduces CPU
        overhead. The main reasons are the inability of threads and processes
        to share Socket and the overhead of multi-threaded locks. For example,
        when the main thread listens for new connection establishment in the
        Socket, the system needs to start a child process to handle the
        request, and the child process needs to access the newly connected
        Socket. But at the same time, the parent process needs to listen to
        the Socket for new connections. Socket contention between parent and
        child processes and the lock maintenance in multi-threaded
        applications are both detrimental to throughput performance.</t>
      </section>

      <section title="Offloading TCP/IP Stack to Network Adapters">
        <section title="Transport Offload Engine(TOE)">
          <t>TOE is a partial offloading technology, which offloads part of
          the TCP/IP stack to the hardware of the network adapter, but some of
          the protocol stack still runs in the kernel space of the operating
          system. This technique can reduce the CPU overhead to some extent.
          Many mature TOE technologies have been accepted by the industry,
          such as generic segmentation offload(GSO), large receiver
          offload(LRO), receiver side scaling(RSS), and checksum offloading.
          However, TOE technology has a performance bottleneck. For example,
          Appendix A has listed the throughput performance measurement of TCP
          and QUIC protocols under different congestion control algorithms
          like BBR and CUBIC<xref target="RFC9438"/>, with TOE enabled.</t>
        </section>

        <section title="RDMA">
          <t>RDMA is an important technology to realize CPU by-pass, and there
          are standardized technologies applied in in data center networks and
          wide area networks, such as RoCEv2 and iWARP, etc., which has a
          history of about 20 years. But the current RDMA technology has some
          limitations. RoCEv2 is oriented to data center networks and requires
          lossless features from the network. RoCEv2 is not defined by IETF.
          iWARP is designed for data center networks, local area networks and
          wide area networks. Due to the underlying TCP protocol, iWARP is
          reliable without intermediate network support, which is aligned with
          design goal of HP-WAN transport protocol, but iWARP still has
          performance limitations.</t>

          <section title="RoCEv2">
            <t>RoCEv2 is defined by Infiniband Trade Association(IBTA) which
            compatibly adapts RDMA transport layer originally from Infiniband
            to TCP/IP protocol stack, and it is more in line with the
            development trend of Ethernet ecosystem. The condition of RoCEv2
            to guarantee high throughput data transmission is that the network
            needs to provide congestion notification mechanisms such as
            explicit congestion notification(ECN) and data center
            bridging(DCB), and the ability of lossless transmission, such as
            priority-based flow control(PFC) mechanism. In short-distant data
            center networks, it is feasible to guarantee such capabilities,
            but in the Internet, providing such capabilities, especially for
            lossless networks, is extremely expensive. Therefore, native
            RoCEv2 is not suitable for high throughput data transmission in
            wide-area lossy environments.</t>
          </section>

          <section title="iWARP">
            <t>iWARP protocol suite was defined in IETF, and it was based on
            TCP/IP stack so that it can work reliably itself. It consists of
            three main layers and protocols to achieve RDMA functionality over
            data center network, local area network, and wide area network.
            The first is the RDMA semantic layer where RDMA read, write and
            send semantics are defined. RDMAP <xref target="RFC5040">RFC
            5040</xref> is in this layer. The second is the RDMA core
            functionality layer where message segmentation and reassembly,
            buffer models, and direct data placement in-order message delivery
            are defined. RDDP <xref target="RFC5041">RFC 5041</xref> is in
            this layer. The third is the boundary marking layer, where data
            framing and integrity, and Framed Protocol Data Unit (FPDU)
            alignment are defined. MPA <xref target="RFC5044">RFC
            5044</xref>is in this layer.</t>

            <t>Three offloading modes of iWARP are defined in <xref
            target="SC07"/>, namely host-based, host-offloaded, and
            host-assisted. Host-based only offloads TCP stack to the network
            adapter. Host-offloaded mode offloads DDP, MPA and TCP stack to
            the network adapter, and host-assisted retains RDMAP and Markers
            function of MPA layer in host. Only DDP, CRC and TCP stack are
            offloaded. There is an obvious gap between the performance of the
            three. More details on the test performance are attached in the
            appendix. The conclusion is that three implementations can not
            guarantee high throughput, and the bandwidth of the network
            adapter can not be fully utilized. Perhaps this is part of the
            reasons why iWARP has less industry support and
            implementations.</t>

            <figure align="center" title="Three Offloading methods of iWARP">
              <artwork type="ascii-art"> +----------------+ +----------------+ +------------------+  
 | +-----++----+  | |    +-----+     | | +-----++-------+ |  
 | |RDMAP||RDDP|  | |    |RDMAP|     | | |RDMAP||Markers| |  
 | +-----++----+  | |    +-----+     | | +-----++-------+ |  
 | +---++-------+ | +-------+--------+ +--------+---------+  
 | |CRC||Markers| |         |                   |            
 | +---++-------+ |         |                   |        HOST
 +-------+--------+         |                   |            
---------+------------------+-------------------+------------
         |                  |                   |            
 +-------v--------+ +-------v--------+   +------v-------+    
 |    +-------+   | |+----++-------+ |   | +----+ +---+ |    
 |    |TCP/IP |   | ||RDDP||Markers| |   | |RDDP| |CRC| |    
 |    +-------+   | |+----++-------+ |   | +----+ +---+ | NIC
 +----------------+ |+---+ +-------+ |   |   +------+   |    
                    ||CRC| |TCP/IP | |   |   |TCP/IP|   |    
                    |+---+ +-------+ |   |   +------+   |    
                    +----------------+   +--------------+    </artwork>
            </figure>
          </section>
        </section>
      </section>
    </section>

    <section title="Requirements for New HP-WAN mechanisms">
      <t>Based on the analysis above, exsiting tranport solutions introduce
      high CPU cost or high offloading complexity, which result in performance
      bottleneck. These solutions are not evolvable to satisfy HP-WAN
      high-throughput performance requirements in T-bit level Ethernet era.
      Therefore, the following new requirements are proposed.</t>

      <section title="Support RDMA">
        <t>For applications with ultra-high throughput transmission
        requirements, RDMA MUST be supported to reduce CPU overhead. It is
        RECOMMENDED that RDMA establish a connection with one handshake and
        breaks the connection with one handshake at the semantic level. Write
        operation MUST be supported, but read operation SHOULD be
        optional.</t>
      </section>

      <section title="Lightweight Transport Layer">
        <t>The iWARP protocol suite can fully offload TCP/IP transport layer,
        but due to the limitations of space and time complexity, performance
        still has a bottleneck, especially on throughput. Therefore, RDMA
        cannot be based on a complete TCP transport protocol implementation,
        and a lightweight transport protocol must be designed. On the other
        hand, the transport protocol should still have some reliability and
        congestion management functions to not depend on any auxiliary
        capabilities provided by the network layer.</t>

        <section title="Congestion Control">
          <t>In terms of congestion control, the mechanism for congestion
          detection is a key factor in the algorithm. Delay, bandwidth, and
          packet loss are all important signals for determining congestion,
          but they have their own advantages and disadvantages in terms of
          false positive rate, fairness, and algorithm complexity. Therefore,
          when congestion control mechanisms, there are some requirements that
          can be considered:</t>

          <t>* TCP's slow start and congestion avoidance mechanism MUST be
          improved to enhance the utilization of bandwidth.</t>

          <t>* It is RECOMMNED to transform the congestion window mechanism
          into a time-interval-based sending rate control mechanism which can
          be easier to implement on network adapter.</t>
        </section>

        <section title="Reliability">
          <t>TCP's rate control relies on the sliding window mechanism, which
          ensures better reliability but also limits the maximum sending
          capability of the sender side to some extent. TCP's ability to
          handle out-of-order reception SHOULD be retained, and data packets
          are stored as soon as they arrive, reducing the occupancy of
          resources. In terms of handling lost packets, a precise
          re-transmission mechanism MUST be designed, with only lost packet
          re-transmission notification and supporting cumulative
          acknowledgement, reducing the impact of packet loss and transmission
          distance on throughput performance.</t>
        </section>

        <section title="Multi-path transmission">
          <t>Multi-path transmission SHOULD be supported. In a single
          connection, multiple streams are maintained in parallel, and the
          streams are independent. Parameters of congestion control can be set
          separately.</t>
        </section>

        <section title="Other Requirements">
          <t>In addition, security MUST be considered in most cases. Packet
          SHOULD be designed with high load ratio to fully consider the
          complexity of offloading. At the same time, the above components
          SHOULD be designed modularly, that is, congestion control, lost
          packet re-transmission, and data encryption can be turned on and off
          on demand.</t>
        </section>
      </section>

      <section title="Application Developer Friendly Interfaces">
        <t>It MUST not change the programming interfaces of mainstream
        applications such as Verbs, Libfabric, Socket, etc. At present, RDMA
        mainly supports Verbs and Libfabric interfaces, and big data
        transmission applications need to develop adaptations for these
        interfaces. If there are distributed applications requiring huge
        amount of data movement based on Socket interface in the future, the
        development of network adapter driver MUST be oriented to Socket
        interface. However, RDMA and Socket abstraction do not match, API
        compatibility is poor, and it is necessary to support standardized
        programming interface.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section title="Acknowledgements">
      <t>Authors would like to thank other team members from China Mobile for
      their contributions. Guangyu Zhao, Shiping Xu, Zongpeng Du, Zhiqiang
      Li.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.5040"?>

      <?rfc include="reference.RFC.5041"?>

      <?rfc include="reference.RFC.5044"?>

      <?rfc include="reference.RFC.8174"?>

      <?rfc include="reference.RFC.9438"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.I-D.cardwell-ccwg-bbr"?>

      <reference anchor="SC07">
        <front>
          <title>Analyzing the impact of supporting out-of-order communication
          on in-order performance with iWARP</title>

          <author fullname="Pavan Balaji" surname="">
            <organization>SC '07: Proceedings of the 2007 ACM/IEEE Conference
            on Supercomputing</organization>
          </author>

          <date year="2007"/>
        </front>
      </reference>
    </references>

    <section title="Appendix A">
      <section title="TCP Performance Test with TOE Enabled">
        <t>TCP was tested under the following environments: CPU type: Intel
        Xeon &reg; Gold 6248R processor, number of cores: 4x24=96, clock
        frequency: 3.0GHz, network adapter: 100Gbps Nvidia CX-5, PCIE 3.0x16,
        and operating system: CentOS 8.</t>

        <t>The experiment is based on the laboratory environment, and the
        single-stream and multi-stream throughput performance are tested under
        different congestion control algorithms. BBRv1 and Cubic are tested
        under the condition of 0.1% and 1% packet loss. It can be seen that
        the throughput performance of the BBR algorithm can reach more than
        10Gbps in the case of a single stream, and the throughput reaches a
        peak of more than 80Gbps in the case of 25 streams sharing the
        network. The throughput performance of BBR algorithm is much higher
        than CUBIC, because CUBIC is very sensitive to packet loss. The
        performance of BBR still can not reach the ceiling of bandwidth.</t>

        <t>The reasons are that on the one hand, the receiver needs to
        maintain more overhead introduced by packet loss lookup and packet
        loss recovery. On the other hand, TOE's optimization effect becomes
        poor. These effects lead to CPU utilization rate approaches the
        bottleneck.</t>

        <figure align="center"
                title="TCP throughput performance,RTT=70ms,MTU=1500">
          <artwork type="ascii-art">+---------+------------+----------+------------+----------+
|         | TCP+BBRv1  |TCP+BBRv1 | TCP+CUBIC  |TCP+CUBIC | 
|         |0.1%Pkt loss|1%Pkt loss|0.1%Pkt loss|1%Pkt loss|
+---------+------------+----------+------------+----------+
| Single  |   14Gbps   |  10Gbps  |   8.6Mbps  |   Null   |
| Stream  |            |          |            |          |
+---------+------------+----------+------------+----------+
|    3    |   41Gbps   | 24.5Gbps |   28Mbps   |   Null   |
| Streams |            |          |            |          |
+---------+------------+----------+------------+----------+ 
|   10    |   70Gbps   |  61Gbps  |   91Mbps   |   Null   |  
| Streams |            |          |            |          |     
+---------+------------+----------+------------+----------+ 
|   25    |   84Gbps   | 84.7Gbps |    Null    |   Null   |
| Streams |            |          |            |          |   
+---------+------------+----------+------------+----------+</artwork>
        </figure>
      </section>

      <section title="QUIC Performance Test with TOE Enabled">
        <t>QUIC performance test is under the following environments: CPU
        type: Intel Xeon Gold 6248 @2.5GHz, 20 cores each. Network adapter:
        Nvidia CX-5, dual-port 100G, PCIE: 3.0x16. This test is over China
        Mobile backbone network(CMNET) from Harbin to Guiyang, over 3000
        kilometers with average RTT around 65ms. The average packet loss is
        around 0.01% to 0.1%.</t>

        <t>BBRv1 is selected as the congestion control algorithm, and the test
        results are obtained under the condition of CPU utilization
        approaching saturation. In the test case of 40 CPU cores, when the
        number of streams exceeds 40, the throughput is maintained at about
        50Gbps. In the test case of 80 CPU cores, when the number of streams
        exceeds 80, the throughput is maintained at about 63Gbps. Under the
        two test conditions, there is a big gap between the theoretical limit
        of 100Gbps bandwidth of network adapter and the real data transmission
        throughput. The results confirm the CPU overhead plays a pivotal role
        in endpoint data transmission throughput.</t>

        <figure align="center"
                title="QUIC Throughput Performance, RTT=65ms, MTU=1500">
          <artwork type="ascii-art">+---------+------------+------------+
|         | QUIC+BBRv1 | QUIC+BBRv1 |
|         |0.1%Pkt loss|0.1%Pkt loss|
|         |  40 cores  |  80 cores  |
+---------+------------+------------+
|   40    |   47.2Gbps | 52.8Gbps   |
| Streams |            |            |
+---------+------------+------------+
|   60    |   42.4Gbps | 57.2Gbps   |
| Streams |            |            |
+---------+------------+------------+
|   80    |   51.2Gbps | 62.4Gbps   |
| Streams |            |            |
+---------+------------+------------+
|   100   |   NULL     | 63.2Gbps   |
| Streams |            |            |
+---------+------------+------------+</artwork>
        </figure>
      </section>

      <section title="iWARP Performance(Referenced from SC07 Paper)">
        <t>Test environment: Chelsio 10GE NIC, 2 Intel Xeon 3.0GHz processors,
        Redhat OS.</t>

        <t>The figure shows that the three offloading methods have very
        different performance in terms of actual network adapter throughput
        and host CPU utilization. In the processing of 128B small messages,
        the CPU utilization of the three methods is relatively low, but the
        throughput is also very low. When processing larger message of 256KB,
        it can be seen that although the CPU utilization rate of the
        host-offloaded method is very low, which can be maintained at 10%, its
        throughput is only 3.5GB, about 1/3 of the actual network adapter
        bandwidth, and its throughput performance is not good. The best
        throughput performance is achieved in host-assisted mode, which is
        partially offloaded. But the best throughput is only about 60% of the
        network adapter bandwidth, and its CPU utilization has reached 80%.
        The host-based model performs worse in terms of both bandwidth and CPU
        utilization.</t>

        <t>The above results show that there are performance bottlenecks in
        all the three implementations of iWARP, which cannot maintain high
        throughput close to the physical bandwidth of network adapters with
        low CPU consumption. One of the main influencing factors is the
        Markers function in MPA layer. Markers have an important role in
        marking the functional boundaries of TCP and RDDP protocols.
        Offloading Markers to network adapters requires too much state
        maintenance resources, which affects the actual throughput. If the
        Markers function is implemented inside the host, the CPU usage will be
        too high. Therefore, although TCP-based iWARP has good reliability and
        scalability, its performance is still limited and needs to be further
        improved.</t>

        <figure align="center"
                title="Performance Comparision of Three Offloading modes of iWARP ">
          <artwork type="ascii-art">+---------+----------+----------+----------+
|         | MSG size | MSG size | MSG size |
|         |  128B    |   1KB    |   256KB  |
|         +-----+----+-----+----+-----+----+  
|         | BW  |CPU | BW  |CPU | BW  |CPU |
|         |     |Util|     |Util|     |Util|
+---------+-----+----+-----+----+-----+----+
|  host-  |100MB|12% |800MB|40% |1.8GB|75% |                
|  based  |     |    |     |    |     |    |
+---------+-----+----+-----+----+-----+----+
|  host-  |1GB  |10% |3.5GB|10% |3.5GB|10% |
|offloaded|     |    |     |    |     |    |
+---------+-----+----+-----+----+-----+----+
|  host-  |500MB|18% |3.5GB|50% |5.8GB|80% |
| assisted|     |    |     |    |     |    |
+---------+-----+----+-----+----+-----+----+</artwork>
        </figure>
      </section>
    </section>
  </back>
</rfc>
