FANTEL Use Cases and Requirements in Wide Area Network

Use Cases

AI Service Scenarios in WAN With the rapid growth of AI service traffic, entities and AIDCs face increasing strain on their computing resources. To address this, WAN provides the essential foundation for integrating and delivering computing power across sites. Based on WAN, on-demand compute leasing offers an elastic, cost-effective way to scale AI resources. AI service scenarios in WAN can be categorized in into 3 scenarios, including sample data transfer, coordinated training, and coordinated inference, as shown in Figure 1.

AI service scenarios in WAN +--------+ | AIDCs |----------------------| AIDCs | +--------+ +--------+ ^ | | ^ | | | | S2.2: Coordinated | | | | model training S1: Sample | | | | between Data | | Wide Area Network | | customer and AIDCs Transfer | | | | | | | | S3: Coordinated model | | | | inference v | | v +------------+ +------------+ | Entity | | Entity | +------------+ +------------+ ]]>

Scenario 1: Sample Data Transfer Sample data transfer refers to the transfer of massive sample data from entities’ storage DCs to their own or third-party AIDCs for model training. Due to the diversity of data security requirements vary among entities, there are two sub-scenarios: 1) Sample Data Migration: To meet the high-throughput and low-latency requirements of AI training, entities generally migrate their sample datasets to AIDC storages, where each training round can access terabytes to petabytes of data efficiently. 2) Remote Sample Data Access: To protect sensitive data, entities with strict security requirements retain their datasets in on-premises storage rather than migrating them to AIDCs. This creates the need for secure, low-latency, lossless approaches that allow timely remote access to sample data during model training.

Scenario 2: Coordinated Model Training The computing power required for AI model training is enormous and grows rapidly, especially as models scale in size and complexity. For example, the computing power demand of LLMs grows rapidly — it is estimated that GPT-6 requires ZFLOPS-scale computing power, reaching a ~2000x increase over GPT-4. A single DC would struggle to meet such enormous demands. Therefore, integrating dispersed computing resources to support LLMs training (including training and fine-tuning of foundational models) has become a key solution. 1) Coordinated Model Training across AIDCs: The scalability of a single AIDC is inherently constrained by its physical infrastructure (e.g., space and power supply). In order to meet the enormous computing power requirements of training, one solution is to coordinate distributed computing resources across multiple AIDCs. It also helps fully utilize the idle computing resources available in different DCs. 2) Coordinated Model Training between Entities and AIDCs: Some entities that have deployed AI facilities in their own DCs have to face rapidly growing computing resource demands. Instead of build new AI infrastructure, which is costly, they can coordinate with leased third-party AIDCs to supplement their capacity. Moreover, entities with strict security requirements can further enhance this approach by incorporating split learning with the input/output layers deployed locally to ensure sensitive data remains on-premises.

Scenario 3: Coordinated Model Inference Entities that have deployed AI facilities in their own DCs also have to face the degradation of TTFT and TPOT as inference concurrency increases continuously. Since expanding on-premises AI capacity is often costly, leasing third-party AIDCs and coordinating inference between entities’ AIDCs and third-party AIDCs can be a cost-effective way to scale concurrent inference capacity.

Sample Data Migration

Use Case Description Since AI training requires multiple rounds of fine-tuning for performance improvement, and each round consumes massive sample data (ranging from terabytes to petabytes), entities need to upload these sample data into AIDCs as soon as possible to start the next round of training. Currently, many entities still rely on shipping physical hard drives to migrate such large datasets, which not only risks data loss if the drives are damaged but is also highly inefficient. For network-based solutions, entities typically have to rent dedicated line services with fixed-bandwidth on a monthly or annually subscription basis, which is less cost-effective because the data transfer traffic is bursty, meaning the bandwidth is only fully utilized during short transfer periods and remains idle for the rest of the time.

Fast Notification Impact For high-efficiency and cost-effective sample-data migration, fast notification of network status enhances the WAN in the following ways: 1) To maximize the available bandwidth for the data transmission, WAN should support fast notification of network status changes across devices, achieving efficient hourly transmission of terabyte-scale sample data by enabling real-time service bandwidth adjustment based on tasks. Moreover, the data migration services should be provisioned on a task basis, eliminating the need for entities to lease high-bandwidth lines on a monthly/yearly basis and thereby significantly reducing costs. 2) During high-speed sample data migration, even minor network failures can trigger significant packet loss, sharply reducing the efficiency. To prevent this the WAN should provide millisecond-level fast failure notification, enabling rapid failure detection and failover to maintain high throughput.

Example

Sample Data Migration from Local Storage to AIDC 10 TB/day ]]> An example of sample data migration is an enterprise that leases the AI services from a third-party AIDC. In this case, the enterprise collects and stores the sample data in its local storage and needs to transfer 10 TB data to AIDC every day. It would take several days if it transferred by shipping hard drives, or 10 days if transfer over a 100 Mbps link. Using real-time network status obtained from the fast notification mechanism, the controller can flexibly adjust service bandwidth on-demand in seconds and ensure high throughput via traffic engineering and load balancing. In this example, the controller adjusts the service bandwidth to 10 Gbps for 3 hours, which is sufficient to complete the 10 TB data migration, and dynamically updates the traffic engineering and load balancing policies to maintain high throughput.

Remote Sample Data Access

Use Case Description Some industries with highly sensitive data prefer to keep their datasets on-premises, avoiding the risk of leakage that may arise when migrating data to third-party AIDCs. The most common method for accessing such data during AI computing is to use RDMA protocols (e.g. InfiniBand and RoCE), which achieve ultra-low latency, actively bypass TCP's congestion control mechanism, and rely on the Go-back-N mechanism to handle packet loss and disorder. The Go-back-N mechanism retransmits all unacknowledged packets (including correctly received ones) after a timeout, causing the RDMA protocol extremely sensitive to packet loss -- even a 0.1% packet loss can cause throughput to drop by roughly 50%. To provide efficient AI services for these industries, robust congestion-control solutions are needed in WANs to minimize latency and packet loss.

Fast Notification Impact Millisecond-level, lightweight notification can be sent to nodes adjacent to or affected by failure or congestion, enabling lossless transmission of sample data and meeting the stringent packet-loss tolerance requirement of RDMA transmission.

Example

Remote Sample Data Transmission via RDMA RDMA ]]> An example of remote data access involves an enterprise with strict security requirements that prohibit storing sensitive data outside its premises, while still wishing to lease computing resources from third-party AIDCs. In this case, remote sample data are transmitted between the AIDC and the enterprise’s local storage via RDMA, which is highly sensitive to packet loss. The distance between the AIDC and the local storage may range from 100 to 500 km. The fast notification mechanism enables flow-based precise congestion control through immediate congestion notification, ensuring lossless RDMA transmission and thereby supporting secure and efficient model training.

Coordinated Model Training across AIDCs

Use Case Description Due to the limited computing resources of a single DC, the training task can be split and coordinated across multiple AIDCs. This approach also helps fully utilize the idle computing resources available in different DCs. Coordinated model training across AIDCs requires the WAN to support massive, highly concurrent and bursty traffic of parameter synchronizations.

Fast Notification Impact For coordinated model training across AIDCs, fast failure and congestion notification enhances WAN in the following ways: 1) Dynamic load balancing: Fast network status notifications allow dynamic load balancing strategies to be deployed in real time, ensuring optimal utilization of network resources and maintaining high performance. 2) Low-latency, lossless parameter synchronization: Fast notification enables millisecond-level congestion control in the WAN, allowing upstream devices to promptly reduce their transmission rates upon detecting impending congestion. 3) Rapid failure protection: Interruptions in parameter synchronization due to network failures can trigger rollback and computation waste, sharply reducing training efficiency . Fast notification enables millisecond-level failure detection and failover, minimizing training disruptions.

Example

Coordinated Model Training across AIDCs ^ | Parameter Synchronization | | | | | | Parallelization Strategies | +---------------------+----------------------+ | +--------+--------+ |LLM Training Task| +-----------------+ ]]> An example of coordinated Model Training across AIDCs is splitting the training task of LLM using parallelization strategies such as pipeline parallelism and data parallelism. During model training, LLM parameters are synchronized across geographically distributed AIDCs via the WAN using the RDMA protocol. The parameter synchronization traffic is highly concurrent and bursty elephant flows, characterized by long duration and large data amount, which can easily cause network congestion. Leveraging fast network status notification, the WAN can perform efficient traffic engineering and load balancing. In addition, fast failure and congestion notifications enable flow-based precise congestion control, ensuring lossless and efficient parameter synchronization.

Coordinated Model Training between Entities and AIDCs

Use Case Description Considering cost and security, some entities may choose to lease third-party AIDCs to meet their rapidly growing computing resource demands. In this case, split learning can be applied, where only input/output layers are deployed locally for data security, while the intermediate layers are deployed in third-party AIDCs for cost efficiency. Activations and gradients are transmitted via the WAN. However, the transmission between entities and AIDCs still requires low latency and even more elastic bandwidth.

Fast Notification Impact For coordinated model training between entities and AIDCs, fast notification mechanism enhances WAN performance in the following ways: 1) Providing low latency through fast congestion and failure notification (as discussed in Section 3.4.2). 2) Enabling elastic bandwidth allocation and dynamic traffic engineering and load balancing strategies through fast network status notification (as discussed in Section 3.2.2).

Example

Coordinated Model Training between entities and AIDCs | +---------+ |+-+ +-+| | (Activations,Gradients) | |+-+ +-+| ||1| |n|| | | ||2| |n|| || | | || | | || |...|-|| || | | || | Split Learning | || | |1|| |+-+ +-+| +---------------------+--------------------+ |+-+ +-+| +-------+ | +---------+ +-------+-------+ | Training Task | +---------------+ ]]> An example of coordinated model training between entities and AIDCs is that an entity only has to build minimum amount of computing resource to deploy input/output layers locally, while leasing third-party AIDC resources to deploy intermediate layers. During model training, the activations of the forward pass and gradients of the backward pass are transmitted via the WAN using the RDMA protocol, which are bursty and latency-sensitive. Fast network status notification enables flexible bandwidth allocation, dynamic traffic engineering and load balancing, while fast congestion notification ensures low latency.

Coordinated Model Inference

Use Case Description Similar to Section 3.5, entities may also choose to lease third-party AIDCs for model inference. Some may completely depend on third-party AIDC for inference which require enough bandwidth and low latency to access via the WAN, while others lease third-party AIDCs as the supplement coordinating with the entities’ local inference which requires low latency, low packet loss and cost-effective transmission.

Fast Notification Impact For coordinated model inference, fast notification mechanism enhances WAN performance in the following ways: 1) Providing low latency through fast congestion and failure notification (as discussed in Section 3.4.2). 2) Enabling elastic bandwidth allocation and dynamic traffic engineering and load balancing strategies through fast network status notification (as discussed in Section 3.2.2).

Example

Coordinated Model Inference | +---------+ |+-+ +-+| | (KV Cache, Activations) | |+-+ +-+| ||1| |n|| | | ||1| |n|| || |...| || | | || |...| || || | | || | Split Learning | || | | || |+-+ +-+| +--------------------+----------------------+ |+-+ +-+| +---------+ | +---------+ Prefill instance +-------+-------+ Decode Instance |Inference Task | +---------------+ ]]> An example of coordinated model inference employs split learning to distribute the inference task between entities and AIDCs. The prefill instance is deployed locally and the decode instance is deployed in third-party AIDCs, since the decode phase has significantly higher GPU memory requirements compared to the prefill phase. This reduces the demand on the entity's local DC and keeps the prompts on-premises. Additionally, the input and output layers of the decode phase can remain in the entity’s DC to meet stricter data security requirements. During model inference, the key/value cache and intermediate activations are transmitted via the WAN using the RDMA protocol. Similar to Section 3.5, fast network status notification enables flexible bandwidth allocation, dynamic traffic engineering and load balancing, while fast congestion and failure notification ensures low latency and effective congestion control.