<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info"
     docName="draft-yao-tsvwg-cco-problem-statement-and-usecases-00"
     ipr="trust200902">
  <front>
    <title
    abbrev="Collective Communication Optimization: Problem Statement and Use cases">Collective
    Communication Optimization: Problem Statement and Use cases</title>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Yizhou Li" initials="Y." surname="Li">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Nanjing, Jiangsu</city>

          <country>China</country>
        </postal>

        <email>liyizhou@huawei.com</email>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>hongyi.huang@huawei.com</email>
      </address>
    </author>

    <author fullname="Dirk KUTSCHER" initials="D." surname="KUTSCHER">
      <organization>HKUST (Guangzhou)</organization>

      <address>
        <postal>
          <street/>

          <city>Guangzhou</city>

          <country>China</country>
        </postal>

        <email>dku@hkust-gz.edu.cn</email>
      </address>
    </author>

    <date day="23" month="October" year="2023"/>

    <area>Transport</area>

    <workgroup>Transport Area Working Group</workgroup>

    <keyword>parallel computing;collective operation;in-network
    computing</keyword>

    <abstract>
      <t>Collective communication is the basic logical communication model for
      distributed applications. When distributed systems scales, the
      communication overhead becomes the bottleneck of the entire system,
      impeding system performance to increase. This draft describes the
      performance challenges when the collective communication is employed in
      a network with more nodes or processes participating in or a larger
      number of such communication rounds required to complete a single job.
      And the document presents several use cases where different aspects of
      collective communication optimization are needed.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Collective communication is the basic logical communication model for
      distributed applications like AI model training and inference,
      high-performance computing, big data analysis and distributed storage.
      It defines several inter-process communication modes upon which modern
      high performance distributed systems can be built. There are several
      existing open standards and programming models that have supported the
      collective communications like Message Passing Interface(MPI),
      Partitioned Global Address Space Language(PGAS), etc. However, these
      programming models only focus on application level, which does not
      differentiate the underlying network capabilities. Existing
      implementation of collective communication employs point-to-point(P2P)
      mode at the network in most cases. Given the nature of collective
      communication involves one or more senders and one or more receivers, it
      inevitably leads to an obvious communication overhead when blunt point
      to point is used inside these distributed systems. There need a more
      optimized and performant communication mechanism that's dedicated
      designed for collective communication.</t>

      <t>One of the implementation methods of collective communication is
      Remote Direct Memory Access(RDMA) mechanism. IB network naturally
      supports RDMA, and is the state-of-the-art networking solution for
      collective communication. To improve collective communication
      performance, IB has some enhancements, like SHArP(Scalable Hierachical
      Aggregation Protocol)<xref target="SHArP"/> and hardware multicast<xref
      target="Hardware_Multicast"/>, which could support collective operations
      offloading, saving bandwidth and reducing communication latency.</t>

      <t>However, Ethernet-based RDMA does not support such capabilities.
      While hopefully, Ethernet is the most widely used link layer protocol,
      Ethernet-based RDMA should evolve to support collective communication
      offloading. And there need some optimization in transport protocols and
      application-network co-design to make Ethernet-based RDMA more suitable
      for collective communication. These work should be considered in
      IETF.</t>

      <t>This draft first presents four different use cases to illustrate
      where collective communication is used and how optimized collective
      communication can upgrade distributed application performance. Then the
      draft points out the fundamental reason that the existing protocols
      cannot meet the high-performance requirements of collective
      communications is that these distributed applications are not
      co-designed with the underlying networking protocols. There is a
      semantic gap between inter-process message transportation and packet
      forwarding, which should be bridged by efficient mapping and
      optimization. This draft further analyses the problem from three
      different perspectives. One is that the current end to end transport
      protocol does not support efficient offloading for collective
      operations. Secondly, the control and management plane does not support
      specific extensions required by collective operations, like network
      offloading based topology awareness. The last one is that the current
      implementation of collective communication can not make use of the
      existing IP multicast protocols, which can reduce the communication
      overhead and save the bandwidth.</t>
    </section>

    <section title="Use Cases">
      <section title="Distributed Training">
        <t>The Large Language Model(LLM) like chatGPT has introduced a
        phenomenal impact to the industry, leading the digital and intelligent
        transformation of the whole society. Foundation models usually have
        over trillions of parameters which inevitably need to be deployed in a
        distributed manner for model training and model inference. In
        foundation models like transformer, the commonly used training mode is
        Mixture of Expert(MoE). MoE has two basic collective operations,
        AlltoAll and Allreduce.</t>

        <t>In AlltoALL phase, the Gate deployed in each device passes the
        intermediate gradients to Feed Forward Nodes(FFNs) deployed in other
        devices, and FFNs then pass the computation results to the block "Add
        &amp; Normalization" for the next-step computation. ALLtoALL
        transmission includes group-to-group communication logic, which will
        incur a lot of bandwidth contention and can be optimized.</t>

        <t>In Allreduce phase, the gradients generated from each device need
        to be submitted to a central node for aggregation. There may have an
        incast problem if the number of distributed nodes is too large and the
        gradients messages take up too much bandwidth. Offloading Allreduce
        operation to network devices can saving lots of transmission bandwidth
        and obviously it could halve the transmission distance and reduce the
        latency.</t>

        <figure align="center" title="Collective communication in MoE ">
          <artwork align="center">        DeVice0               ...               DeVice1
+--------------------+                  +--------------------+
|                    |      DeVice0     |                    |
|  +---------------  |        ...       |  +---------------  |
|  |self attention|  |      DeViceN     |  |self attention|  |
|  +--------------+  |                  |  +--------------+  |
|  +--------------+  |                  |  +--------------+  |
|  |  Add &amp; Norm  |  |                  |  |  Add &amp; Norm  |  |
|  +--------------+  |                  |  +--------------+  |
|  +--------------+  |                  |  +--------------+  |
|  |   +------+   |  |                  |  |   +------+   |  |
|  |   | Gate | &lt;------+                +-----&gt;+ Gate |   |  |
|  |   +------+   |  | |   +--------+   |  |   +------+   |  |
|  |   +------+   |  | +--&gt;+AlltoAll+&lt;--+  |   +------+   |  |
|  |   | FFN0 | &lt;------+   +--------+   +-----&gt;+ FFN0 |   |  |
|  |   +------+   |  | |                |  |   +------+   |  |
|  +--------------+  | |   +--------+   |  +--------------+  |
|  +--------------+  | +--&gt;+AlltoAll+&lt;--+  +--------------+  |
|  |  Add &amp; Norm  +&lt;---+   +--------+   +-&gt;+  Add &amp; Norm  |  |
|  +--------------+  |                  |  +--------------+  |
|         |          |                  |         |          |
+---------|----------+                  +---------|----------+
          |                                       |
          |                +---------+            |
          +--------------&gt; |Allreduce| &lt;----------+
                           +---------+
                          
                           Distributed</artwork>
        </figure>
      </section>

      <section title="High-Performance Computing">
        <t>The basis for HPC is parallel computing. In modern HPC clusters,
        parallel computing is usually implemented through multiple CPU cores
        programming at the same time, finishing a task concurrently and
        leading to higher performance. During parallel computing, messages are
        passed between processes across different CPU cores, and this is when
        several collective operations are needed. In main-worker mode,
        Allreduce happens when the main node gathers messages from several
        workers and computes to get the aggregation results.</t>

        <figure align="center" title="Main-worker Allreduce in HPC clusters">
          <artwork align="center">                   +-------+
         +-------&gt; |worker0|
         |         +-------+
         |
         |
         | Allreduce
         |
         |
+----+   |         +-------+
|Main| &lt;---------&gt; |worker1|
+----+   |         +-------+
         |
         |
         | Allreduce
         |
         |         +-------+
         +-------&gt; |worker2|
                   +-------+</artwork>
        </figure>
      </section>

      <section title="Distributed Storage Systems">
        <t>Collective operations like broadcast is also used in distributed
        storage systems. Primary servers perform data operations such as
        replication and modification and broadcast the message to backup
        servers. Currently, the broadcast operation is implemented by using
        commodity RDMA Network Interface Card (RNIC) with unicast operations,
        and the performance of RDMA-based data replication is lagged by two
        drawbacks of unicast traffic: data redundancy and independent
        replicating states. On the one hand, data redundancy in multiple
        replications incurs a large amount of bandwidth waste on the network
        links to replica servers. On the other hand, the independent
        replicating states reduplicate the CPU by independently posting the
        request, polling the completion, and keeping track of the delivery
        status of each unicast replication, which undoubtedly incurs a lot of
        overhead during transmission.</t>

        <figure align="center"
                title="Broadcast in distributed storage systems">
          <artwork align="center">              +-------+
              |Primary|
              +---+---+
                  |
                  |    Replica
                  |   broadcast
                  v
               +--+---+
   +-----------+Switch+----------+
   |           +--+---+          |
   |              |              |
   |              |              |
   |              |              |
   v              v              v
+--+---+       +--+---+      +---+--+
|Backup|       |Backup|      |Backup|
+------+       +------+      +------+</artwork>
        </figure>
      </section>

      <section title="Big Data Analysis(MapReduce)">
        <t>The main stream distributed big data analysis systems like spark
        also have collective operations. The most communication performance
        bottleneck happens in Shuffle phase. The Shuffle phase involves data
        movement across multiple workers and it's originally implemented via
        TCP/IP Socket for distributed communication among processes which
        incurs low performance. The Shuffle phase can be accelerated via MPI
        Java bindings. TCP will be replaced by MPI-based new transport for
        high-performance data movement.</t>

        <figure align="center"
                title="Collective operations in Shuffle phase of big data systems">
          <artwork align="center">                  before
                 shuffle
                   +--+ +--+
                   |c1| |c1|
         +-------+ |c1| |a1| after shuffle
         |worker2| |c1| |b1|
         +-------+ +--+ +--+

             ^                     +--+ +--+
             |                     |a1| |a1|
             |           +-------+ |a1| |c1|
+------+     +---------&gt; |worker1| |a1| |b1|
|Master|                 +-------+ +--+ +--+
+------+
                             ^
+------+                     |
|Driver|                     |
+------+                     |
             +---------------^
             |
             |
             |
             v     +--+ +--+
                   |b1| |b1|
         +-------+ |b1| |a1|
         |worker3| |b1| |c1|
         +-------+ +--+ +--+</artwork>
        </figure>
      </section>
    </section>

    <section title="Problem Statement">
      <t>The demand for computing resource in AI/HPC applications is growing
      rapidly, and single node computing power can not meet it. Parallel
      computing has become a trend. Parallel computing is a type of computing
      architecture in which several processors simultaneously execute
      multiple, smaller calculations broken down from an overall larger,
      complex problem. Collective communication plays a role in data
      aggregation, data distribution, and synchronization in distributed
      computing systems. The collective communication primitives include
      collective operations like Reduce, All-Reduce, Bcast, Alltoall, Scatter,
      Gather, and synchronization operations like Barrier, etc.</t>

      <figure title="Collective Communication of Parallel Computing">
        <artwork align="center">+---------------+--------------+---------------------------------+
|     Type      |  Function    |         Description             |
+---------------+--------------+---------------------------------+
|               |  Bcast       | One to group.                   |
|               |              | One process sends (broadcasts)  |
|               |              | some data to all the processes  |
|               |              | in a group.                     |
|               +--------------+---------------------------------+
|               |  Gather      | Group to one.                   |
|               |              | If an array is scattered across |
|               |              | all processes in the group. And |
|               |              | one process (root) collects each|
|               |              | piece of the array into a       |
|               |              | specified array.                |
|               +--------------+---------------------------------+
|      Data     |  Allgather   | All processes, not just the     |
|    Movement   |              | root, receive the result of     |
|               |              | Gather.                         |
|               +--------------+---------------------------------+
|               |  Scatter     | One-To-Group.                   |
|               |              | One process distributes the data|
|               |              | into n segments, where the i-th |
|               |              | segment is sent to the i-th     |
|               |              | process in the group which has  |
|               |              | n processes.                    |
|               +--------------+---------------------------------+
|               |  Alltoall    | This is an extension to         |
|               |              | Allgather.Each process sends    |
|               |              | distinct data to each receiver. |
|               |              | The j-th block from process i is|
|               |              | received by process j and stored|
|               |              | in the i-th block.              |
+---------------+--------------+---------------------------------+
|               |  Reduce      | Group to one.                   |
|               |              | Used to collect data or partial |
|               |              | results from multiple processing|
|               |              | units and to combine them into a|
|               |              | global result by a chosen       |
|               |              | operator.                       |
|               +--------------+---------------------------------+
|               |  All-Reduce  | Sistribute the result of a      |
|     Data      |              | Reduce operation to all         |
|               |              | processes in the group.         |
|  Aggregation  +--------------+---------------------------------+
|               |Reduce-Scatter| scattering the result of        |
|               |              | reduction to all processes      |
|               +--------------+---------------------------------+
|               |  Scan        | A Scan operation performs       |
|               |              | partial reductions on           |
|               |              | distributed data.               |
+---------------+--------------+---------------------------------+
|Synchronization|  Barrier     | A synchronous operation to      |
|               |              | synchronize all processes       |
|               |              | within a communicator.          |
+---------------+--------------+---------------------------------+</artwork>
      </figure>

      <section title="Collective Message Transport Issues">
        <section title="Reliability">
          <t>Traditional transport layer provides reliable transmission
          mechanisms, like TCP, QUIC, etc., only for point-to-point
          communication, and supports best-effort transmission services for
          multicast and broadcast communication. Therefore, in order to meet
          the requirement of reliable transmission, the implementations of
          collective operations like Bcast and AlltoAll are often based on
          point-to-point operations at hosts.</t>

          <t>This leads to the waste of host CPU resources caused by repeated
          packaging and sending, while multiple identical data packets on the
          same path also constitute redundant traffic and exacerbate network
          load. To solve these problems, parallel computing networks need a
          reliable transmission mechanism suitable for collective operations,
          and cooperating with reasonable traffic engineering and congestion
          control mechanisms.</t>

          <t>Collective communication optimization generally uses a network
          node like a switch to perform the collective operation like Reduce.
          The network node receives the data from multiple senders and
          computes the result based on the provided operator, e.g. SUM
          operator. There are two ways in terms of reliability to ensure the
          correctness of the collective operations.</t>

          <t>The first one is to keep using the end-to-end reliability. The
          intermediate node that performs the 'Reduce' is not considered as an
          end point that participants in the traditional reliability
          guarantee. Refer to the figure below, Host 1 to 3 are workers for a
          Reduce operations, and Host 4 is the parameter server that collects
          the data from all the workers to perform a SUM computing. The
          transport session is established between each worker and the
          parameter server. Switch in the middle performs the in-network
          aggregation. In case there is any packet loss between any of the
          workers and switch or between switch and parameter server,
          end-to-end reliability mechanism like re-transmission would be
          triggered by end points. The intermediate switch releases the
          maintance burden of transport session and most of the state
          maintenance. Consider the scenarios where the data packet is small
          like in some of the high performance computing. The switch simply
          keeps a single packet from every worker until it can make the
          summation computing from all the workers. The state maintained is
          minimum. It should be considered how the reliability can be
          efficiently ensured in terms of the fast loss detection and
          signaling when an intermediate node plays a role. In addition to the
          reliability considerations, how to ensure the data to be aggregated
          at the correct intermediate node or nodes should be considered.
          Considering encryption scenarios, ubiquitous connection encryption
          (QUIC etc.) make it hard to employ performance enhancing
          functions.</t>

          <t>The second one is to treat the intermediate switch as an end
          point. The reliability is guaranteed between the worker and switch
          and between switch and parameter server as two independent sessions.
          That would require the switch to support the full transport function
          which includes the session establishments and all the state
          maintenance. It is expected that the lighter transport session
          maintenance will bring more benefit. In addition, there should be
          fall back signaling to tolerate the faults. Considering encryption
          scenarios, If network devices are required to encrypt and decrypt
          data, it will require a lot of resources, maintain a large number of
          sessions, and also involve issues such as maintaining secret
          keys.</t>

          <figure align="center"
                  title="In-Network Aggregation Packet Termination">
            <artwork align="center">         +-----+
         |Pkt_1|
         +-----+
+------+
|Host_1+---------+
+------+         |
                 |
         +-----+ |                              +-------+
         |Pkt_2| |  +-------------------------+ |Pkt_Agg|
         +-----+ |  |                         | +-------+
+------+         |  |         Switch          |            +------+
|Host_2+---------+--&gt; (In-Network Aggregation)+-----------&gt;|Host_4|
+------+         |  |                         |            +------+
                 |  +-------------------------+
         +-----+ |
         |Pkt_3| |
         +-----+ |
+------+         |
|Host_3+---------+
+------+</artwork>
          </figure>
        </section>

        <section title="Semantic Gap Between Message and Packet">
          <t>Collective operations offloading utilizes network devices to
          achieve low computational accuracy and high IO computing tasks,
          achieving optimization of collective operations. Collective
          communication optimization devices not only complete packet routing
          and forwarding, but also need to process transport layer messages.
          Therefore, the transport layer needs to complete the mapping
          mechanism between packets and messages, and complete the
          transmission layer message transmission. Existing transport layers
          cannot provide this function.</t>

          <t>Message size will impact the performance of implementation of
          collective operations. The packet size usually has an upper bound,
          for example, the jumbo frame of Ethernet is 9.6kB, while message
          size dose not have limitation and it is determined by applications.
          For network devices, processing large size messages will incur a lot
          of overhead, reflected in deep buffer and its Serialization and
          Deserialization(SerDes) may have much pressure. In addition, it will
          also impact the message sending rate at the host and thus lower the
          end-to-end system performance. How to choose the appropriate message
          size and optimize its processing is still a problem.</t>

          <figure align="center"
                  title="In-Network computing devices need to process transport layer messages">
            <artwork align="center" type="ascii-art">       Host         Traditional        In-Network         Host
                  Network Devices   Network Devices
 +-------------+                                    +--------------+
 |             |                                    |              |
 | +---------+ |                                    | +----------+ |
 | |   App   | |                  +--------------+  | |   App    | |
 | +---------+ |                  |              |  | +----------+ |
 |             |                  |              |  |              |
++-------------+------------------+--------------+--+--------------++
||             |             Messages            |  |              ||
|| +---------+ |                  | +----------+ |  | +----------+ ||
|| |Transport| |                  | | Message  | |  | | Transport| ||
|| |  Layer  | |                  | |Processing| |  | |  Layer   | ||
|| +---------+ |                  | +----------+ |  | +----------+ ||
||             |                  |              |  |              ||
++-------------+------------------+--------------+--+--------------++
 |             |                  |              |  |              |
 |             | +--------------+ |              |  |              |
++-------------+-+--------------+-+--------------+--+--------------++
||             | |              | |              |  |              ||
|| +---------+ | |              | |              |  | +----------+ ||
|| | Network | | |            Packets            |  | |  Network | ||
|| |  Layer  | | |              | |              |  | |   Layer  | ||
|| |  (IB,   | | |              | |              |  | |   (IB,   | ||
|| | TCP/IP) | | | +----------+ | | +----------+ |  | |  TCP/IP) | ||
|| +---------+ | | |          | | | |          | |  | +----------+ ||
||             | | |          | | | |          | |  |              ||
|| +---------+ | | |  Packet  | | | |  Packet  | |  | +----------+ ||
|| |  Link   | | | |Forwarding| | | |Forwarding| |  | |   Link   | ||
|| |  Layer  | | | |          | | | |          | |  | |   Layer  | ||
|| |(IB Link,| | | |          | | | |          | |  | | (IB Link,| ||
|| |   Eth)  | | | +----------+ | | +----------+ |  | |    Eth)  | ||
|| +---------+ | |              | |              |  | +----------+ ||
||             | |              | |              |  |              ||
++-------------+-+--------------+-+--------------+--+--------------++
 +---------+---+ +-----+----+---+ +----+----+----+  +-----+--------+
           |           ^    |          ^    |             ^
           +-----------+    +----------+    +-------------+</artwork>
          </figure>
        </section>

        <section title="Blocking and Non-blocking Communications">
          <t>Parallel computing communication operators can be divided into
          two categories based on whether to wait for message feedback results
          before proceeding with the next step of computation: blocking and
          non-blocking. Blocking collective operations will experience process
          blocking until communication is completed. Non-blocking operations
          allow communication to be handled by the backend, and communication
          and computation can overlap. Parallel computing networks need to
          support two types of communication methods simultaneously, and for
          blocking communication, it is even more necessary to use methods
          such as in-network computing technology to to optimize collective
          communication performance.</t>
        </section>
      </section>

      <section title="Control and Management Plane Issues">
        <section title="In-Network Computing Primitives">
          <t>After supporting collective operations offloading, collective
          communication library needs to be expanded to not only support
          end-to-end mode, but also end-to-network communication modes, whitch
          means suporting in-network processing/computing functions.</t>

          <t>The implementation of the offloaded collective operations relies
          on the capabilities provided by physical network devices, known as
          in-network computing primitives. Currently, most online computing
          devices use programmable network devices such as P4(Programming
          Protocol-independent Packet Processors) and NPL(Network Programming
          Language). These programmable switches are not powerful enough to
          program non-trivial application-layer behavior. And their platforms
          are too heterogeneous to manage. We need to standardize the
          in-network computing primitives that network devices can provide,
          including data types, operations, etc. Based on the standard
          in-network computing primitives, we need to implement parallel
          in-network computing communication operators.</t>
        </section>

        <section title="Topology Awareness">
          <t>In data center, fat tree or CLOS topology is widely used. There
          are some full meshed network topology as well.</t>

          <t>As the scale of parallel computing clusters continues to
          increase, it is difficult to achieve optimal transmission
          performance solely through automatic convergence of traditional
          routing protocols. It is necessary to cooperate with topology-aware
          algorithms to explore complex topologies, and complete path
          planning, traffic optimization, etc. However, existing
          topology-aware algorithms do not consider collective operations
          offloading scenarios and need to be modified to realize end-network
          cooperation, meeting the requirements of end-to-end path planning
          and traffic optimization.</t>
        </section>
      </section>

      <section title="One to Group Transmission Issues">
        <t>IP multicast has been designed to support broadcast related
        applications like live streaming and video conferencing. There are
        some standardized IP multicast protocols, like PIM<xref
        target="RFC7761">RFC 7761</xref>and BIER<xref target="RFC8279">RFC
        8279</xref>. However, existing IP multicast protocols havn't been made
        the full use for distributed application systems which require
        collective communication. Most collective operations still use unicast
        mode for transmission which definitely incurs a lot of overhead,
        reflected in information redundancy, bandwidth occupancy and host CPU
        consumption. What we need is low latency multi-destination delivery
        (potentially with reliability or at least failure detection).
        IP-Multicast is more a potential solution (maybe not a good one), but
        still worth trying. Even though existing IP multicast protocols may
        have their best suited application scenarios and they differ in state
        maintainance and multicast tree construction, they are still well
        worth promoting for collective communication scenarios. Collective
        operations like Bcast and AlltoAll can be augmented by extending IP
        multicast protocols, as well as other composite operations like
        reduce-scatter and all-gather.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>Collective communication optimization may introduce some security and
      privacy concerns, especially when collective operations offloading is
      needed.</t>

      <t>On one hand, the distributed nature of computations and the
      involvement of network devices raise issues about data confidentiality,
      integrity, and authentication. There are some potential vulnerabilities
      when data processed over the network is exposed to unauthorized access.
      It's sugguested to support both security-enabled and security-less
      deployments, so that limited domains<xref target="RFC8799"/> do not have
      to pay the penalty of expensive crypto or authority operations. Because
      application and the network within limited domains can be mutual trust
      with each other, since they could both belong to the same administrator.
      Extending the technology to the Internet should be designed together
      with some intrinsic protective actions.</t>

      <t>On the other hand, encrypted data brings challenges and performance
      issues for processing on network devices.. Decrypting and encrypting
      data on network devices is not only inefficient, but also involves
      issues such as key management and authorization. Processing encrypted
      data may not be applicable to all encryption algorithms, and not
      suitable for all scenarios.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.7761"?>

      <?rfc include="reference.RFC.8279"?>

      <?rfc include="reference.RFC.8799"?>
    </references>

    <references title="Informative References">
      <reference anchor="SHArP">
        <front>
          <title>Scalable Hierarchical Aggregation Protocol (SHArP): A
          Hardware Architecture for Efficient Data Reduction</title>

          <author fullname="Richard L. Graham" surname="Graham">
            <organization>Mellanox Technologies, Inc.</organization>
          </author>

          <date year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1109/COMHPC.2016.006"/>
      </reference>

      <reference anchor="Hardware_Multicast">
        <front>
          <title>Fast and Scalable MPI-Level Broadcast using InfiniBand's
          Hardware Multicast Support</title>

          <author fullname="Jiuxing Liu" surname="Liu">
            <organization>Computer and Information Science The Ohio State
            University</organization>
          </author>

          <date year="2004"/>
        </front>

        <seriesInfo name="DOI" value="10.1109/ipdps.2004.1302912 "/>
      </reference>
    </references>
  </back>
</rfc>
