<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-liu-ops-cco-cm-requirement-00"
     ipr="trust200902">
  <front>
    <title abbrev="Operations and Management Area Working Group">Requirements
    from Control and Management Viewpoint for Collective Communication
    Optimization</title>

    <author fullname="Chang Liu" initials="C." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>liuchangjc@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <!---->

    <date day="23" month="October" year="2023"/>

    <area>Ops &amp; Management</area>

    <workgroup>Operations and Management Area Working Group</workgroup>

    <keyword>collective communication;in-network computing</keyword>

    <abstract>
      <t>Collective communication optimization is a key means to improve the
      performance of distributed applications, due to that communication has
      become the bottleneck to degrade applications or business with the
      growth of the scale of distributed systems. The industry and academy has
      worked on proposing solutions to upgrade the collective communication
      operations. However, there has been the problem of lacking for unified
      guidelines.</t>

      <t>This draft provide requirements from the control and management
      viewpoint.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>In recent years, with the development and evolution of various
      applications and business, especially the rapid growth of AI
      applications, distributed computing performance has become more and more
      important and has gradually become a key factor restricting the growth
      of these applications. As the primary communication mode of current
      distributed computing systems, the performance of collective
      communication is crucial. However, there exists many problems to be
      solved for collective communication to improve performance. On the one
      hand, many collective communication operations implemented by
      message-level communication libraries like MPI and NCCL mainly depend on
      the unicast point-to-point communication mechanism, leading to the
      redundancy of network information, the underutilization of network
      resources and the waste of network capabilities. On the other hand,
      since the underlying network protocols and collective communication are
      not co-designed, there is a semantic gap between inter-process message
      transportation and packet forwarding. Therefore, there is huge space for
      the optimization of collective communication. At present, the industry
      and academia are also actively promoting the development, implementation
      and deployment of collective communication optimization topic.</t>

      <t>The research group Computing in the Network, COIN for short, also
      focus on this topic. Their work goal is mainly to investigate how
      network data plane programmability can improve Internet architecture,
      with a too broad focus scope including network functions offloading,
      machine learning acceleration, in-network caching and in-network
      control, etc. In addition to the solution of collective operation
      offloading COIN talk about for collective communication optimization,
      multicast substituting for single point unicast, scheduling tasks and
      planning transportation paths by topology awareness and bridge semantic
      gap between inter-process message transportation and packet forwarding
      can also play the role of optimizing collective communication.</t>

      <t>This draft provide some necessary requirements from the network
      control and management viewpoint, combined with the optimization
      solutions of collective communication offloading, multicast mechanisms,
      topology awareness and semantic gap bridge between inter-process message
      transportation and packet forwarding, to guideline the standardization
      work of collective communication optimization.</t>
    </section>

    <section title="Requirements">
      <section title="Memory Management">
        <t>Scarce memory resources provided by network devices for collective
        communication MUST be scheduled and controlled, e.g. assigning a
        scheduling priority to collective communication offloading tasks.
        Compared to the amount of collective communication message in the
        applications such as AI, Big Data System, HPC, etc., it is severely
        mismatched and extremely scarce for memory resource provided by
        network devices for collective communication, such as network
        programmable switches.</t>

        <t>Use Case<xref target="ESA"/>. The memory of programmable switch is
        scarce for the amount of gradient transmitted in distributed training.
        There is some existing work to solve this problem like pool-based
        streaming and dynamic sharing, which are not enough yet. A use case of
        fully utilizing the memory of programmable switch is that the control
        and management module of switch assigns a priority to the aggregation
        task, to dynamically and preemptively schedule the aggregation tasks
        in the data plane, thus making more full use of memory in the form of
        switch aggregators.</t>

        <figure align="center"
                title="The Mismatched between Device Memory and Communication Volume">
          <artwork align="center" type="ascii-art">+----------+  +----------+           +----------+
|          |  |          |           |          |
| worker 1 |  | worker 2 |           | worker n |
|          |  |          |  ... ...  |          |
+----+-----+  +----+-----+           +-----+----+
     |             |                       |
     |             |                       |
     +------+------+-----------------------+
            |GB/TB-level gradients
+-----------+-----------+           +-----------+
|           |           |           |           |
|  +--------+--------+  |           |           |
|  |     Switch      |  |           |  Control  |
|  |   Aggregators   |  |  Manage   |     &amp;     |
|  +-----------------&lt;--+-----------+ Management|
|  |Memory for others|  | Schedule  |           |
|  +-----------------+  |           |           |
| Switch Memory = 10MB  |           |           |
+-----------------------+           +-----------+</artwork>
        </figure>
      </section>

      <section title="Topology Management">
        <t>Topology awareness and mapping work are REQUIRED to be done to put
        some of the end-host computing on the network nodes for collective
        communication optimization. In many collective operation tasks, the
        logical relationship between nodes is usually described in the form of
        graph, and then mapping to the physical network. Therefore, collective
        communication offloading requires awareness of the network topology
        and making efficient mappings.</t>

        <t>Use Case. In the parameter server architecture commonly used in
        distributed training, the parameter server can be reasonably mapped to
        spine switches in the fat tree physical network with being aware of
        network topology. Under this mapping mechanism, the traffic path is
        more simplified and the traffic volume of the whole network is greatly
        compressed. Compared to the traditional collective communication mode,
        the optimized end-to-network or end-to-network-to-end one with
        topology awareness and mapping makes the physical topology and the
        logical topology closer, more friendly and unified.</t>

        <figure align="center"
                title="Topology Management and Topology Mapping">
          <artwork type="ascii-art">                 Logical Topology
                +----------------+
                |Parameter Server|
                +--------+-------+
                         |
          +----------+---+------------+
          |          |                |
     +----+---+ +----+---+        +---+----+
     |Worker 1| |Worker 2| ...... |Worker n|
     +--------+ +--------+        +--------+
                         |Mapping
                         |
              +----------+---------+
              |Management &amp; Control|
              |                    |
              | Topology Awareness |
              | Paths Planning     |
              +----------+---------+
                         |
                         |Mapping
                         v
                 Physical Topology
             +-----+         +-----+
             |Spine|         |Spine|
             +--+--+         +--+--+
                |               |
     +----------+-+------------++-----------+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |Leaf |      |Leaf |      |Leaf |      |Leaf |
  +--+--+      +--+--+      +--+--+      +--+--+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |     |      |     |      |     |      |     |
+-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+
|GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|
+---+ +---+  +---+ +---+  +---+ +---+  +---+ +---+</artwork>
        </figure>
      </section>

      <section title="Interfaces Management">
        <t>Some collective communication interfaces MUST be defined and
        managed for application developers to shield tedious network
        engineering details, such as flow control, packet organization,
        chip-specific programming language, etc. If not, applications
        developers will need too much arcane knowledge and expertise, which is
        beyond their willingness and prevent from the evolution of the
        emerging applications.</t>

        <t>Use case. The industry and academy have actually proposed some
        abstractions of collective communication operations, such as
        collective communication libraries MPI, NCCL, NetRPC<xref
        target="NetRPC"/>, etc. In the control plane, these interfaces need to
        be configured and instantiated to complete the part of collective
        communication functionality.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="ESA">
        <front>
          <title>Efficient Data-Plane Memory Scheduling for In-Network
          Aggregation</title>

          <author fullname="Hao Wang" surname="Wang">
            <organization>iSING Lab, Hong Kong University of Science and
            Technology</organization>
          </author>

          <date year="2022"/>
        </front>
      </reference>

      <reference anchor="NetRPC">
        <front>
          <title>NetRPC: Enabling In-Network Computation in Remote Procedure
          Calls</title>

          <author fullname="Bohan Zhao" surname="Zhao">
            <organization>Tsinghua University</organization>
          </author>

          <date year="2023"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
