Collective Communication Optimization: Problem Statement and Use cases

Collective communication is the basic logical communication model for distributed applications like AI model training and inference, high-performance computing, big data analysis and distributed storage. It defines several inter-process communication modes upon which modern high performance distributed systems can be built. There are several existing open standards and programming models that have supported the collective communications like Message Passing Interface(MPI), Partitioned Global Address Space Language(PGAS), etc. However, these programming models only focus on application level, which does not differentiate the underlying network capabilities. Existing implementation of collective communication employs point-to-point(P2P) mode at the network in most cases. Given the nature of collective communication involves one or more senders and one or more receivers, it inevitably leads to an obvious communication overhead when blunt point to point is used inside these distributed systems. There need a more optimized and performant communication mechanism that's dedicated designed for collective communication. One of the implementation methods of collective communication is Remote Direct Memory Access(RDMA) mechanism. IB network naturally supports RDMA, and is the state-of-the-art networking solution for collective communication. To improve collective communication performance, IB has some enhancements, like SHArP(Scalable Hierachical Aggregation Protocol) and hardware multicast, which could support collective operations offloading, saving bandwidth and reducing communication latency. However, Ethernet-based RDMA does not support such capabilities. While hopefully, Ethernet is the most widely used link layer protocol, Ethernet-based RDMA should evolve to support collective communication offloading. And there need some optimization in transport protocols and application-network co-design to make Ethernet-based RDMA more suitable for collective communication. These work should be considered in IETF. This draft first presents four different use cases to illustrate where collective communication is used and how optimized collective communication can upgrade distributed application performance. Then the draft points out the fundamental reason that the existing protocols cannot meet the high-performance requirements of collective communications is that these distributed applications are not co-designed with the underlying networking protocols. There is a semantic gap between inter-process message transportation and packet forwarding, which should be bridged by efficient mapping and optimization. This draft further analyses the problem from three different perspectives. One is that the current end to end transport protocol does not support efficient offloading for collective operations. Secondly, the control and management plane does not support specific extensions required by collective operations, like network offloading based topology awareness. The last one is that the current implementation of collective communication can not make use of the existing IP multicast protocols, which can reduce the communication overhead and save the bandwidth.

The Large Language Model(LLM) like chatGPT has introduced a phenomenal impact to the industry, leading the digital and intelligent transformation of the whole society. Foundation models usually have over trillions of parameters which inevitably need to be deployed in a distributed manner for model training and model inference. In foundation models like transformer, the commonly used training mode is Mixture of Expert(MoE). MoE has two basic collective operations, AlltoAll and Allreduce. In AlltoALL phase, the Gate deployed in each device passes the intermediate gradients to Feed Forward Nodes(FFNs) deployed in other devices, and FFNs then pass the computation results to the block "Add & Normalization" for the next-step computation. ALLtoALL transmission includes group-to-group communication logic, which will incur a lot of bandwidth contention and can be optimized. In Allreduce phase, the gradients generated from each device need to be submitted to a central node for aggregation. There may have an incast problem if the number of distributed nodes is too large and the gradients messages take up too much bandwidth. Offloading Allreduce operation to network devices can saving lots of transmission bandwidth and obviously it could halve the transmission distance and reduce the latency.

DeVice0 ... DeVice1 +--------------------+ +--------------------+ | | DeVice0 | | | +--------------- | ... | +--------------- | | |self attention| | DeViceN | |self attention| | | +--------------+ | | +--------------+ | | +--------------+ | | +--------------+ | | | Add & Norm | | | | Add & Norm | | | +--------------+ | | +--------------+ | | +--------------+ | | +--------------+ | | | +------+ | | | | +------+ | | | | | Gate | <------+ +----->+ Gate | | | | | +------+ | | | +--------+ | | +------+ | | | | +------+ | | +-->+AlltoAll+<--+ | +------+ | | | | | FFN0 | <------+ +--------+ +----->+ FFN0 | | | | | +------+ | | | | | +------+ | | | +--------------+ | | +--------+ | +--------------+ | | +--------------+ | +-->+AlltoAll+<--+ +--------------+ | | | Add & Norm +<---+ +--------+ +->+ Add & Norm | | | +--------------+ | | +--------------+ | | | | | | | +---------|----------+ +---------|----------+ | | | +---------+ | +--------------> |Allreduce| <----------+ +---------+ Distributed

The basis for HPC is parallel computing. In modern HPC clusters, parallel computing is usually implemented through multiple CPU cores programming at the same time, finishing a task concurrently and leading to higher performance. During parallel computing, messages are passed between processes across different CPU cores, and this is when several collective operations are needed. In main-worker mode, Allreduce happens when the main node gathers messages from several workers and computes to get the aggregation results.

+-------+ +-------> |worker0| | +-------+ | | | Allreduce | | +----+ | +-------+ |Main| <---------> |worker1| +----+ | +-------+ | | | Allreduce | | +-------+ +-------> |worker2| +-------+

Collective operations like broadcast is also used in distributed storage systems. Primary servers perform data operations such as replication and modification and broadcast the message to backup servers. Currently, the broadcast operation is implemented by using commodity RDMA Network Interface Card (RNIC) with unicast operations, and the performance of RDMA-based data replication is lagged by two drawbacks of unicast traffic: data redundancy and independent replicating states. On the one hand, data redundancy in multiple replications incurs a large amount of bandwidth waste on the network links to replica servers. On the other hand, the independent replicating states reduplicate the CPU by independently posting the request, polling the completion, and keeping track of the delivery status of each unicast replication, which undoubtedly incurs a lot of overhead during transmission.

+-------+ |Primary| +---+---+ | | Replica | broadcast v +--+---+ +-----------+Switch+----------+ | +--+---+ | | | | | | | | | | v v v +--+---+ +--+---+ +---+--+ |Backup| |Backup| |Backup| +------+ +------+ +------+

The main stream distributed big data analysis systems like spark also have collective operations. The most communication performance bottleneck happens in Shuffle phase. The Shuffle phase involves data movement across multiple workers and it's originally implemented via TCP/IP Socket for distributed communication among processes which incurs low performance. The Shuffle phase can be accelerated via MPI Java bindings. TCP will be replaced by MPI-based new transport for high-performance data movement.

before shuffle +--+ +--+ |c1| |c1| +-------+ |c1| |a1| after shuffle |worker2| |c1| |b1| +-------+ +--+ +--+ ^ +--+ +--+ | |a1| |a1| | +-------+ |a1| |c1| +------+ +---------> |worker1| |a1| |b1| |Master| +-------+ +--+ +--+ +------+ ^ +------+ | |Driver| | +------+ | +---------------^ | | | v +--+ +--+ |b1| |b1| +-------+ |b1| |a1| |worker3| |b1| |c1| +-------+ +--+ +--+

The demand for computing resource in AI/HPC applications is growing rapidly, and single node computing power can not meet it. Parallel computing has become a trend. Parallel computing is a type of computing architecture in which several processors simultaneously execute multiple, smaller calculations broken down from an overall larger, complex problem. Collective communication plays a role in data aggregation, data distribution, and synchronization in distributed computing systems. The collective communication primitives include collective operations like Reduce, All-Reduce, Bcast, Alltoall, Scatter, Gather, and synchronization operations like Barrier, etc.

Traditional transport layer provides reliable transmission mechanisms, like TCP, QUIC, etc., only for point-to-point communication, and supports best-effort transmission services for multicast and broadcast communication. Therefore, in order to meet the requirement of reliable transmission, the implementations of collective operations like Bcast and AlltoAll are often based on point-to-point operations at hosts. This leads to the waste of host CPU resources caused by repeated packaging and sending, while multiple identical data packets on the same path also constitute redundant traffic and exacerbate network load. To solve these problems, parallel computing networks need a reliable transmission mechanism suitable for collective operations, and cooperating with reasonable traffic engineering and congestion control mechanisms. Collective communication optimization generally uses a network node like a switch to perform the collective operation like Reduce. The network node receives the data from multiple senders and computes the result based on the provided operator, e.g. SUM operator. There are two ways in terms of reliability to ensure the correctness of the collective operations. The first one is to keep using the end-to-end reliability. The intermediate node that performs the 'Reduce' is not considered as an end point that participants in the traditional reliability guarantee. Refer to the figure below, Host 1 to 3 are workers for a Reduce operations, and Host 4 is the parameter server that collects the data from all the workers to perform a SUM computing. The transport session is established between each worker and the parameter server. Switch in the middle performs the in-network aggregation. In case there is any packet loss between any of the workers and switch or between switch and parameter server, end-to-end reliability mechanism like re-transmission would be triggered by end points. The intermediate switch releases the maintance burden of transport session and most of the state maintenance. Consider the scenarios where the data packet is small like in some of the high performance computing. The switch simply keeps a single packet from every worker until it can make the summation computing from all the workers. The state maintained is minimum. It should be considered how the reliability can be efficiently ensured in terms of the fast loss detection and signaling when an intermediate node plays a role. In addition to the reliability considerations, how to ensure the data to be aggregated at the correct intermediate node or nodes should be considered. Considering encryption scenarios, ubiquitous connection encryption (QUIC etc.) make it hard to employ performance enhancing functions. The second one is to treat the intermediate switch as an end point. The reliability is guaranteed between the worker and switch and between switch and parameter server as two independent sessions. That would require the switch to support the full transport function which includes the session establishments and all the state maintenance. It is expected that the lighter transport session maintenance will bring more benefit. In addition, there should be fall back signaling to tolerate the faults. Considering encryption scenarios, If network devices are required to encrypt and decrypt data, it will require a lot of resources, maintain a large number of sessions, and also involve issues such as maintaining secret keys.

+-----+ |Pkt_1| +-----+ +------+ |Host_1+---------+ +------+ | | +-----+ | +-------+ |Pkt_2| | +-------------------------+ |Pkt_Agg| +-----+ | | | +-------+ +------+ | | Switch | +------+ |Host_2+---------+--> (In-Network Aggregation)+----------->|Host_4| +------+ | | | +------+ | +-------------------------+ +-----+ | |Pkt_3| | +-----+ | +------+ | |Host_3+---------+ +------+

Collective operations offloading utilizes network devices to achieve low computational accuracy and high IO computing tasks, achieving optimization of collective operations. Collective communication optimization devices not only complete packet routing and forwarding, but also need to process transport layer messages. Therefore, the transport layer needs to complete the mapping mechanism between packets and messages, and complete the transmission layer message transmission. Existing transport layers cannot provide this function. Message size will impact the performance of implementation of collective operations. The packet size usually has an upper bound, for example, the jumbo frame of Ethernet is 9.6kB, while message size dose not have limitation and it is determined by applications. For network devices, processing large size messages will incur a lot of overhead, reflected in deep buffer and its Serialization and Deserialization(SerDes) may have much pressure. In addition, it will also impact the message sending rate at the host and thus lower the end-to-end system performance. How to choose the appropriate message size and optimize its processing is still a problem.

Host Traditional In-Network Host Network Devices Network Devices +-------------+ +--------------+ | | | | | +---------+ | | +----------+ | | | App | | +--------------+ | | App | | | +---------+ | | | | +----------+ | | | | | | | ++-------------+------------------+--------------+--+--------------++ || | Messages | | || || +---------+ | | +----------+ | | +----------+ || || |Transport| | | | Message | | | | Transport| || || | Layer | | | |Processing| | | | Layer | || || +---------+ | | +----------+ | | +----------+ || || | | | | || ++-------------+------------------+--------------+--+--------------++ | | | | | | | | +--------------+ | | | | ++-------------+-+--------------+-+--------------+--+--------------++ || | | | | | | || || +---------+ | | | | | | +----------+ || || | Network | | | Packets | | | Network | || || | Layer | | | | | | | | Layer | || || | (IB, | | | | | | | | (IB, | || || | TCP/IP) | | | +----------+ | | +----------+ | | | TCP/IP) | || || +---------+ | | | | | | | | | | +----------+ || || | | | | | | | | | | || || +---------+ | | | Packet | | | | Packet | | | +----------+ || || | Link | | | |Forwarding| | | |Forwarding| | | | Link | || || | Layer | | | | | | | | | | | | Layer | || || |(IB Link,| | | | | | | | | | | | (IB Link,| || || | Eth) | | | +----------+ | | +----------+ | | | Eth) | || || +---------+ | | | | | | +----------+ || || | | | | | | || ++-------------+-+--------------+-+--------------+--+--------------++ +---------+---+ +-----+----+---+ +----+----+----+ +-----+--------+ | ^ | ^ | ^ +-----------+ +----------+ +-------------+

Parallel computing communication operators can be divided into two categories based on whether to wait for message feedback results before proceeding with the next step of computation: blocking and non-blocking. Blocking collective operations will experience process blocking until communication is completed. Non-blocking operations allow communication to be handled by the backend, and communication and computation can overlap. Parallel computing networks need to support two types of communication methods simultaneously, and for blocking communication, it is even more necessary to use methods such as in-network computing technology to to optimize collective communication performance.

After supporting collective operations offloading, collective communication library needs to be expanded to not only support end-to-end mode, but also end-to-network communication modes, whitch means suporting in-network processing/computing functions. The implementation of the offloaded collective operations relies on the capabilities provided by physical network devices, known as in-network computing primitives. Currently, most online computing devices use programmable network devices such as P4(Programming Protocol-independent Packet Processors) and NPL(Network Programming Language). These programmable switches are not powerful enough to program non-trivial application-layer behavior. And their platforms are too heterogeneous to manage. We need to standardize the in-network computing primitives that network devices can provide, including data types, operations, etc. Based on the standard in-network computing primitives, we need to implement parallel in-network computing communication operators.

In data center, fat tree or CLOS topology is widely used. There are some full meshed network topology as well. As the scale of parallel computing clusters continues to increase, it is difficult to achieve optimal transmission performance solely through automatic convergence of traditional routing protocols. It is necessary to cooperate with topology-aware algorithms to explore complex topologies, and complete path planning, traffic optimization, etc. However, existing topology-aware algorithms do not consider collective operations offloading scenarios and need to be modified to realize end-network cooperation, meeting the requirements of end-to-end path planning and traffic optimization.

IP multicast has been designed to support broadcast related applications like live streaming and video conferencing. There are some standardized IP multicast protocols, like PIMRFC 7761and BIERRFC 8279. However, existing IP multicast protocols havn't been made the full use for distributed application systems which require collective communication. Most collective operations still use unicast mode for transmission which definitely incurs a lot of overhead, reflected in information redundancy, bandwidth occupancy and host CPU consumption. What we need is low latency multi-destination delivery (potentially with reliability or at least failure detection). IP-Multicast is more a potential solution (maybe not a good one), but still worth trying. Even though existing IP multicast protocols may have their best suited application scenarios and they differ in state maintainance and multicast tree construction, they are still well worth promoting for collective communication scenarios. Collective operations like Bcast and AlltoAll can be augmented by extending IP multicast protocols, as well as other composite operations like reduce-scatter and all-gather.

Collective communication optimization may introduce some security and privacy concerns, especially when collective operations offloading is needed. On one hand, the distributed nature of computations and the involvement of network devices raise issues about data confidentiality, integrity, and authentication. There are some potential vulnerabilities when data processed over the network is exposed to unauthorized access. It's sugguested to support both security-enabled and security-less deployments, so that limited domains do not have to pay the penalty of expensive crypto or authority operations. Because application and the network within limited domains can be mutual trust with each other, since they could both belong to the same administrator. Extending the technology to the Internet should be designed together with some intrinsic protective actions. On the other hand, encrypted data brings challenges and performance issues for processing on network devices.. Decrypting and encrypting data on network devices is not only inefficient, but also involves issues such as key management and authorization. Processing encrypted data may not be applicable to all encryption algorithms, and not suitable for all scenarios.

TBD.