<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-yao-alto-core-level-load-balancing-00"
     ipr="trust200902">
  <front>
    <title abbrev="Application-Layer Traffic Optimization ">A Load-aware core
    level load balancing framework</title>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Zhiqiang Li" initials="Z." surname="Li">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>lizhiqiangyjy@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Tian Pan" initials="T." surname="Pan">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100876</code>

          <country>China</country>
        </postal>

        <email>pan@bupt.edu.cn</email>
      </address>
    </author>

    <author fullname="Yan Zou" initials="Y." surname="Zou">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100876</code>

          <country>China</country>
        </postal>

        <email>zouyan@bupt.edu.cn</email>
      </address>
    </author>

    <date day="13" month="March" year="2023"/>

    <area>Transport</area>

    <workgroup>Application-Layer Traffic Optimization</workgroup>

    <keyword>framework;load balance</keyword>

    <abstract>
      <t>Most existing literature on load balancing in data center
      networks(DCN) focuses on balancing traffic between servers (links), but
      there is relatively little attention given to balancing traffic at the
      core-level of individual servers. In this draft, we present a load
      balancing framework for DCN that is aware of core-level load, in order
      to address this issue. Specifically, our approach transfers real-time
      load from CPUs to an L4 load balancer, which then selects a core with
      lower load based on load information to deliver the data packet.
      Theoretically, our approach can completely avoid this problem, making
      the system more stable and enabling higher CPU utilization without
      overprovisioning.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Current load balancing strategies in data centers primarily focus on
      link balancing, with the goal of distributing traffic evenly across
      parallel paths to improve network link utilization and prevent network
      congestion. Methods such as ECMP <xref target="RFC2991"/>and WCMP<xref
      target="WCMP"/> use hashing to distribute traffic to different paths.
      However, these methods do not consider core-level server load balancing,
      which can lead to actual load imbalances in data center networks where
      heavy-hitter flows coexist with low-rate flows. Some existing works
      estimate server load balancing between servers based on traffic, but
      there are two issues. First, load estimation may have bias, leading to
      traffic being assigned to heavily-loaded servers. Second, these methods
      lack granularity at the CPU core level, which may result in the
      phenomenon of single-core overload within servers. In this paper, we
      attempt to address these issues by transmitting CPU load to the load
      balancer, which can obtain global load information and then allocate
      traffic to servers based on core-level targets.</t>

      <t>Our solution has the following challenges, in server-side, the first
      challenge is to smoothly manage the rapidly changing load on the server.
      The second challenge is to insert load information into data packets and
      transmit it to the load balancer through the shortest path. Finally, it
      may require multi-hop routing to reach the destination, and thus
      ensuring real-time load balancing becomes a critical issue. And in load
      balancer:, the first challenge for the load balancer is to allocate
      processing cores to streams based on load information. The second
      challenge is to accurately deliver colored business streams to the
      appropriate cores. Finally, the load balancer needs to ensure the
      consistency of streams while minimizing extreme single-core
      pressure.</t>
    </section>

    <section title="Conventions Used in This Document">
      <section title="Terminology">
        <t>CID Core Identifier</t>

        <t>CPU Central Processing Unit</t>

        <t>LB Load Balancer</t>

        <t>NIC Network Interface Card</t>
      </section>

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14<xref target="RFC2119"/><xref target="RFC8174"/> when, and only
        when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <section title="Framework Overview">
      <t>We present the overall framework of our design, as shown in Fig 1.
      Each server sends internal core load information to the load balancer,
      and we use smoothed CPU utilization as a measure of load. The server
      encapsulates the load into a load-aware data packet and passes it to the
      load balancer through layer two or layer three methods. The load
      balancer can be a x86-based server running virtual machine or a switch
      based on programmable ASIC. The load balancer parses the load
      information carried in the load-aware data packet and maintains k
      server-CPU pairs with the lowest load using the data structure of a
      minimum heap. New connections are fairly assigned to these k server-CPU
      pairs by the load balancer. To ensure consistency of flows, we use a
      Redirect table in the load balancer to record the state of flows. Data
      packets that hit table entries are directly forwarded to the
      corresponding CPU.</t>

      <figure align="center"
              title="Figure 1: Load-aware core-level LB Framework">
        <artwork type="ascii-art">                +----------+-----+-----+        +--------------------+
      Flow      |  Flow ID | DIP | CID |  miss  | Minimum heap with  |
--------------&gt; +----------+-----+-----+ ------&gt;|k least-loaded cores|
                |          |     |     |        |                    |
                +----------+-----+-----+        +----------+---------+
                     Redirect table                        |
                           |                  Select core  |
                           v  hit             for IP flow  |
                +----------------------+&lt;------------------+
     data packet|   L4 Load-balancer   |data packet
     +----------+                      +----------+
     |          |     (Tofino/x86)     |          |
     | +-------&gt;+--+-^-----------^--+--+&lt;------+  |
     | |           | | Load-aware|  |          |  |
     | |           | |   packet  |  |          |  |
  +--v-+---+    +--v-+---+    +--+--v--+    +--+--v--+
  | Rack1  |    | Rack2  |    | Rack3  |    | Rack4  |
  | +----+ |    | +----+ |    | +----+ |    | +----+ |
  | +----+ |    | +----+ |    | +----+ |    | +----+ |
  |        |    |        |    |        |    |        |
  | +----+ |    | +----+ |    | +----+ |    | +----+ |
  | +----+ |    | +----+ |    | +----+ |    | +----+ |
  |        |    |        |    |        |    |        |
  +--------+    +--------+    +--------+    +--------+</artwork>
      </figure>
    </section>

    <section title="Server side design">
      <t>To smooth the historical loads of each core on the server side, we
      use exponential smoothing to smooth the CPU utilization obtained each
      time:</t>

      <t>Load_n = alpha * Load_get + (1-alpha) * Load_n-1</t>

      <t>Where Load_get is the load value obtained this time, Load_n-1 is the
      result of the last calculation, and alpha is the smoothing parameter.
      With the above formula, we can obtain a smoothed load value Load_n to
      represent the CPU load at this time.</t>

      <t>When congestion occurs in the network, the core load information
      carried by the data packets may become invalid. Therefore, we need to
      record the timestamp when the data packets are sent from the server in
      the packets. When the packets arrive at the load balancer, the load
      balancer calculates the transmission delay of the packets. If the
      transmission delay is too large, the load balancer will not select that
      server as the target for traffic delivery (assuming there is only one
      path from the LB to the server) because the path is already congested.
      At this point, regardless of whether the server's cores are overloaded
      or not, no traffic will be assigned to that server.</t>

      <t>To design the packet message structure to carry load information, it
      is not necessary to record the load of all cores in the packet. Only the
      load of the lowest n cores needs to be recorded. We have designed two
      different solutions to deal with different scenarios. Firstly, when
      there is a fixed path between the load balancer and the server, source
      routing can be used to send the packets to the load balancer. The packet
      message structure is designed as follows&#65292;after the Ethernet frame
      header, we add the SR field, and each time it goes to a switch on the
      path, the packet is bounced one SR header from the stack and finally
      arrives at the load balancer, and the packet also contains the source
      IP, CPU ID, CPU load and timestamp.</t>

      <figure align="center" title="Figure 2: Message format">
        <artwork type="ascii-art">+-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Destination Mac                               |
+-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Source Mac                                   |
+-+~ 16 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+
|     Type    |    SR1     |  SR2       |    SR3     |    SR4     |
+-+-+-+-+~ 32 bits~ -+-+-+-+-+-+--+-+~ 8 bits~-+-+-+~ 8 bits~-+-+-+
|        Source IP                |   CPU ID   |    CPU Load      |
+-+-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Timestamp                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+</artwork>
      </figure>

      <t>Secondly, when multiple hops of non-fixed routing are required
      between the load balancer and the server, the core ID and core load can
      be inserted in the IPv6 packet and the packet can be routed to the load
      balancer. We mark the packet as a load notification packet in the
      Traffic Class field of IPv6, and fill in the 8-bit CPU ID and 8-bit core
      load in the Flow Label field, and then route it to the load
      balancer.</t>
    </section>

    <section title="LB side design">
      <t>The load balancer needs to parse the load notification packets from
      the servers, extract the load of each core in each server, and maintain
      an internal minimum heap structure. The load information of the cores
      that has been obtained within a certain period of time is adjusted
      through the heap to obtain the k cores with the lowest load. Before the
      next minimum heap is generated, we allocate the new flows to these k
      cores with the lowest load through polling.</t>

      <t>Ordinary L4 load balancers map packets destined for a service with a
      virtual IP address (VIP) to a pool of servers with multiple direct IP
      addresses (DIP or DIP pool). In our solution, the load balancer needs to
      be accurate at the core level. Therefore, we specify the DIP and
      delivery core directly for the first packet and record the flow
      identifier (Flow ID), direct IP address (DIP), and core identifier (CID)
      in a table to ensure flow consistency. In case of extreme situations,
      such as when a large single flow causes too much pressure on the core,
      we can reduce the CPU load by using the method of back pressure on the
      large flow.</t>

      <t>We need to mark the packets to specify the target core in the
      destination server. For IP, we can overwrite the original VIP with DIP.
      For core ID, in order not to overwrite the original information of the
      user packet, we construct a new packet header and insert it into the
      user packet to be transmitted to the server NIC. The NIC parses the
      destination core ID, removes the added packet header, and delivers the
      restored user packet to the designated core.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <reference anchor="WCMP" target="WCMP">
        <front>
          <title>WCMP: Weighted cost multipathing for improved fairness in
          data centers</title>

          <author>
            <organization>Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul
            Kabbani, Leon Poutievski, Arjun Singh, and Amin
            Vahdat</organization>
          </author>

          <date month="April" year="2014"/>
        </front>
      </reference>

      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.2991"?>

      <?rfc include="reference.RFC.8174"?>
    </references>
  </back>
</rfc>
