FANTEL scenarios and requirements in Wide Area Network

FANTEL scenarios and requirements in Wide Area Network China Telecom

109, West Zhongshan Road, Tianhe District Guangzhou Guangdong 510000 CN hujy5@chinatelecom.cn

China Telecom

109, West Zhongshan Road, Tianhe District Guangzhou Guangdong 510000 CN huzh2@chinatelecom.cn

China Telecom

109, West Zhongshan Road, Tianhe District Guangzhou Guangdong 510000 CN zhuyq8@chinatelecom.cn

Routing FANTEL RFC This document introduces the main scenarios related to AI services in WAN, as well as the requirements for FANTEL(FAst Notification for Traffic Engineering and Load balancing) in these scenarios. Traditional network management mechanisms are often constrained by slow feedback and high overhead, limiting their ability to react quickly to sudden link failures, congestion, or load imbalances. Therefore, these new AI services need FANTEL to provide real-time and proactive notifications for traffic engineering and load balancing, meeting the ultra-high throughput and lossless data transmission requirements of these AI service scenarios.

Introduction The rapid development of Large Language Models (LLMs) necessitates substantial computing power. Hyperscalers build their own AI Data Centers (AIDCs) to train foundational models. However, most enterprises do not have a demand for training foundational models but want to meet the needs of fine-tuning and inference cost-effectively, so a good solution is to rent third-party AIDCs for LLMs fine-tuning and inference, requiring IP network to fulfill their needs. IP network consists of IP Backbone and IP Metropolitan Area Networks (IP MAN). IP MAN interconnects various customers and data centers (including AIDCs) within the metropolitan area, while IP Backbone interconnects IP MANs and data centers (including AIDCs). IP Backbone and IP MAN belong to IP Wide Area Network (IP WAN, or WAN for short). The AI services in WAN, including sample data transmission, coordinated model training and inference, require networks to efficiently manage traffic and rapidly adapt to network changes. points out that existing network management mechanisms such as FRR, BFD and ECN which often rely on delayed feedback or reactive responses, resulting in network performance degradation, longer service disruptions, or inefficient resource utilization. Therefore, FANTEL is proposed to implement real-time and reliable notifications of network events, effectively supporting Traffic Engineering (TE) functions such as load balancing, failure protection, and congestion control. WAN need to deploy FANTEL to ensure high throughput and lossless transmission of data, meeting the new demands of AI services.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

AIDC: AI Data Center ECMP: Equal Cost Multi Path FANTEL: Fast Notification for Traffic Engineering and Load Balancing INT: Inband Network Telemetry LLM: Large Language Model MAN: Metropolitan Area Networks RDMA: Remote Direct Memory Access TE: Traffic Engineering WAN: Wide Area Network

Use Cases For most customers, the cost of owning and maintaining AI facilities is prohibitively high. A solution is to rent AI facilities, which are located in third-party AIDCs to fulfill customers' LLMs training or inference requirements conveniently. Under these circumstances, customers access AIDCs via WAN to get various AI services as shown in Figure 1, including sample data transmission, coordinated model training across AIDCs, and coordinated model inference between the customer and AIDCs.

AI service scenarios in WAN +--------+ | AIDC |----------------------| AIDC | +--------+ +--------+ ^ | | ^ S1: Sample | | Wide Area Network | | S3: Coordinated model data | | | | inference transmission | | | | v | | v +-------------+ +--------------+ | Customer | | Customer | +-------------+ +--------------+ ]]>

Scenario 1: Sample data transmission When customers train AI models in third-party AIDCs, they need to transmit massive sample data into AIDCs. Due to differentiated customers' requirements for data security, there are two sub-scenarios involved.

Sub-scenario 1.1: Transmitting sample data into storage system Since the training process of LLMs needs rounds of fine-tuning for performance improvement, with each round consuming massive sample data (ranging from terabytes to petabytes). These customers need to upload sample data into AIDCs as soon as possible to start the training. To provide high-efficiency and cost-effective sample data transmission, WAN needs to meet the following requirements: 1. Customers usually obtain fixed bandwidth based on dedicated line services to meet their service needs. However, the customer's sample data transmission requirements are intermittent and need to be as fast as possible, thus requiring WAN to have the ability to flexibly adjust bandwidth based on tasks. 2. To maximize the available bandwidth for the data transmission, this scenario requires the WAN to support fast notification of network status changes between devices, enabling real-time service bandwidth adjustment, enabling efficient hourly transmission of terabyte-scale sample data. 3. During high-speed sample data transmission, network failures can lead to a large number of packet losses, resulting in a sharp decrease of data transmission efficiency. This scenario requires WAN to have millisecond-level fast failure protection capability, enabling rapid failure detection and failover.

Sub-scenario 1.2: Directly transmitting sample data to AI servers Certain customers with stringent data security requirements prohibit storing sample data outside their facilities. To address this problem, sample data must be uploaded to the AI training servers by using RDMA protocols while these servers are performing training tasks. Current mainstream RDMA protocols rely on Go-Back-N mechanism, making them highly sensitive to latency and packet loss (Even a 0.1% packet loss rate can degrade computational efficiency by 50%). To achieve lossless transmission of sample data, based on the requirements of sub-scenario 1.1, WAN needs further support millisecond-level fast congestion control, meeting the requirement of zero packet loss in RDMA transmission. Therefore, this scenario requires the mechanism of FANTEL to quickly notify upstream devices to reduce traffic speed based on the router cache situation.

Scenario 2: Coordinated model training The scaling laws demonstrate the performance of LLMs scales with model size, sample dataset size, and the amount of computing power used for training. The computing power demand of LLMs grows rapidly (It's estimated that GPT-6 requires ZFLOPS-scale computing power, reaching a ~2000x increase over GPT-4). The model training requirements of customers consist of foundational model training and fine-tuning. The computing power of a single AIDC is limited by physical infrastructure (e.g., space and power supply), making it inadequate to meet the demands of LLM training. Thus, a solution is proposed to fulfill the computing power demand of ultra-scale LLMs training through the efficient coordination of distributed computing resources across multiple AIDCs. Besides that, there are always some residual computing resources that are insufficient to meet the demands of a single customer. These resources can be coordinated across AIDCs to meet more customers' demands. For customers with high data security requirements and self-built DCs, the collaborative training between the customer's own DC and AIDC can achieve ultra-scale LLMs training while ensuring sample data does not leave customer's DC (Requiring the input and output layers of LLM to be deployed within the customer's DC). In this scenario, the training task of LLM is split across multiple AIDCs based on parallelization strategies such as pipeline parallelism and data parallelism. During model training, the parameters of LLMs need to be synchronized among AIDCs. The synchronization traffic of the parameter plane is transmitted via RDMA protocol which usually features some elephant flows. Therefore, WAN should provide efficient and lossless transmission of parameter plane data. WAN needs to meet the following requirements: 1. The characteristics of elephant flow are long duration and large data amount, which can easily cause network congestion. The synchronization of parameter plane requires low latency and zero packet loss. This scenario requires WAN to have millisecond-level fast congestion control capability, rapidly notifying upstream devices to slow down traffic rates upon detecting impending congestion. In addition, because traditional load balancing often relies on static policy, WAN needs to have a fast response for load balancing, immediately adjust load balancing decisions in response to network changes, ensuring optimal resource utilization and performance. 2. Interruption of parameter plane synchronization due to network failure may result in breakpoint rollback, causing wastage of computing power, leading to a sharp decrease in computational efficiency . This scenario requires WAN to implement millisecond-level failure protection, which can quickly detect network failures and failover.

Scenario 3: Coordinated model inference Many customers have deployed AI servers in their own DCs to support LLM inference applications. However, the high deployment cost and operational complexity of on-premises deployment limit the scale of computing power. Due to the increasing inference concurrency, this on-premises deployment method cannot meet the computing power demand. To address this, the collaboration model inference between customer and AIDCs presents a more efficient, agile, and cost-effective approach to realize elastic computing power scaling. In this scenario, the training task of LLM is split across customers and AIDCs based on parallelization strategies such as pipeline parallelism and expert parallelism. Taking the LLMs inference based on Prefill-Decode disaggregation architecture as an example, the input and output layers of Prefill/decode are placed in the customer, while other layers are placed in the AIDC, ensuring large-scale inference concurrency, utilizing the computing resources in AIDCs to handle larger-scale inference concurrency and ensuring that sample data does not leave customer's DC. During model inference, the parameter synchronization between AIDCs are transmitted via RDMA protocol. Similar to scenario 2, this scenario also requires WAN to have real-time elephant flow load balancing, millisecond-level congestion control, and fast network failure protection capabilities.

Problem Statement According to the AI scenarios mentioned above, the primary challenge for WAN is real-time traffic engineering and load balancing. Current traffic engineering mechanisms have difficulty providing low-latency and low-overhead solutions that meet the above requirements, presenting the following issues: 1. Current load balancing techniques face great challenges in highly dynamic environments. One of the core issues is the lack of timely awareness and adaptive response to network state changes. Traditional mechanisms often rely on periodic global state synchronization or static policies, which results in delayed decision-making. The current controller-based load balancing uses In-situ OAM (IOAM) to obtain network status information. IOAM provides visibility into traffic by embedding telemetry data directly in packets. However, IOAM data is extracted and reported by the device CPU to a controller which adds latency and limits responsiveness . Moreover, controllers typically process telemetry in software, resulting in delayed decision-making. The delay of controller-based load balancing typically at second-scale, inevitably leads to network congestion and severe packet loss. 2. Existing flow control mechanisms rely on delayed feedback or reactive responses, which can lead to suboptimal network performance in high-latency or long-RTT environments like WAN. TCP-based congestion control is a receiver-driven congestion control which uses feedback signals from the receiver to adjust the transmission rate of the sender. These signals are subject to RTT delays, especially problematic in high-speed dynamic environments. ECN marks packets to indicate congestion. However, ECN relies on end-to-end signaling and lacks precise real-time feedback. INT provides path-level telemetry by inserting metadata at each hop, which is returned to the sender via the ACK. Some congestion control algorithms, such as High Precision Congestion Control (HPCC), utilize INT for precise load-awareness. The telemetry based on INT generates an RTT delay before the sender receives feedback, which limits the response capability. These end-to-end signaling-based flow control mechanisms introduce tens of milliseconds of latency in large-scale WAN, which fails to meet the requirements for lossless data transmission. 3. Existing failure protection mechanisms like BFD and FRR are widely deployed, they both have limitations in speed and scope. BFD is designed for rapid failure detection by sending frequent control packets between peers, but high probe frequency not only increases CPU and bandwidth usage but also strains the control plane in large-scale networks. Furthermore, 50ms detection cycle also makes it difficult for BFD to meet the detection requirements of some large-scale networks for link failures. Routing convergence mechanisms depend on routing protocol convergence, which may take hundreds of milliseconds. FRR serves as the complementary mechanism to routing convergence, achieving millisecond-level failover through pre-computed backup paths. Due to protecting against only adjacent failures, FRR lacks flexibility and responsiveness in complex topologies, with recovery latency reaching tens of milliseconds. Traditional Failure protection mechanisms rely on periodic failure detection and centralized rerouting, resulting in recovery times that are not fast enough.

Requirements To solve the above-mentioned problems, FANTEL is needed to provide real-time, rapid notification of network events to relevant network nodes, including: 1. Fast network status notification. FANTEL uses traffic state detection to monitor traffic patterns, link utilization, and node load to trigger notifications on significant deviations . Nodes can adjust the path and traffic rate in real-time based on FANTEL, achieving link status to achieve efficient traffic engineering and load balancing. 2. Fast congestion notification. FANTEL provides a fast, low-latency notification mechanism that can detect and alert network devices to congestion events in real time. When congestion occur, node can adjust data transmission rate and re-route the transmission route based on FANTEL, preventing packet loss. 3. Fast failure protection. FANTEL uses fast failure detection and notification to monitor real-time link/node status. When failure occurs, a node with protection mechanisms may immediately switch to backup paths, reroute traffic, or suppress affected routes, ensuring service reliability. In summary, FANTEL provides a real-time notification mechanism that can be used in WAN, enabling bandwidth utilization, lossless transmission, and fast failover in different AI scenarios.

IANA Considerations TBC

Security Considerations TBC

References Normative References Requirements of Fast Notification for Traffic Engineering and Load Balancing Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing

Contributors Thanks to all the contributors.