<?xml version="1.0" encoding="utf-8"?>
<!-- 
     draft-rfcxml-general-template-standard-00
  
     This template includes examples of the most commonly used features of RFC XML with comments
     explaining how to customise them. This template can be quickly turned into an I-D by editing 
     the examples provided. Look for [REPLACE], [REPLACE/DELETE], [CHECK] and edit accordingly.
     Note - 'DELETE' means delete the element or attribute, not just the contents.
     
     Documentation is at https://authors.ietf.org/en/templates-and-schemas
-->
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including
most browsers -->
<!DOCTYPE rfc [
  <!ENTITY nbsp "&#160;">
  <!ENTITY zwsp "&#8203;">
  <!ENTITY nbhy "&#8209;">
  <!ENTITY wj "&#8288;">
]>
<!-- If further character entities are required then they should be added to the DOCTYPE above.
     Use of an external entity file is not recommended. -->

<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="info"
  docName="draft-hhz-fantel-sar-wan-01"
  ipr="trust200902"
  obsoletes=""
  updates=""
  submissionType="IETF"
  xml:lang="en"
  version="3">
  <!-- [REPLACE] 
       * docName with name of your draft
     [CHECK] 
       * category should be one of std, bcp, info, exp, historic
       * ipr should be one of trust200902, noModificationTrust200902, noDerivativesTrust200902,
  pre5378Trust200902
-->

  <front>
    <title abbrev="draft-hhz-fantel-sar-wan-01">FANTEL Use Cases and Requirements in Wide Area
      Network</title>
    <!--  [REPLACE/DELETE] abbrev. The abbreviated title is required if the full title is longer than
    39 characters -->

    <seriesInfo name="Internet-Draft" value="draft-hhz-fantel-sar-wan-01" />

    <author fullname="Fan Zhang" initials="F" surname="Zhang">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangdong</region>
          <code>510000</code>
          <country>CN</country>
        </postal>
        <email>zhangf52@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Jiayuan Hu" initials="J" surname="Hu">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangdong</region>
          <code>510000</code>
          <country>CN</country>
        </postal>
        <email>hujy5@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Zehua Hu" initials="Z" surname="Hu">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangdong</region>
          <code>510000</code>
          <country>CN</country>
        </postal>
        <email>huzh2@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Yongqing Zhu" initials="Y" surname="Zhu">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangdong</region>
          <code>510000</code>
          <country>CN</country>
        </postal>
        <email>zhuyq8@chinatelecom.cn</email>
      </address>
    </author>

    <date year="2025" />
    <!-- On draft submission:
         * If only the current year is specified, the current day and month will be used.
         * If the month and year are both specified and are the current ones, the current day will
           be used
         * If the year is not the current one, it is necessary to specify at least a month and day="1" will
    be used.
    -->

    <area>Routing</area>
    <workgroup>FANTEL</workgroup>
    <!-- "Internet Engineering Task Force" is fine for individual submissions.  If this element is
          not present, the default is "Network Working Group", which is used by the RFC Editor as
          a nod to the history of the RFC Series. -->

    <keyword>RFC</keyword>
    <!-- [REPLACE/DELETE]. Multiple allowed.  Keywords are incorporated into HTML output files for
         use by search engines. -->

    <abstract>
      <t>
        This document introduces the main scenarios related to AI services in WAN, as well as their
        requirements for
        FANTEL (FAst Notification for Traffic Engineering and Load balancing) in these scenarios.
        Traditional network management mechanisms are often constrained by slow feedback and high
        overhead, limiting
        their ability to react quickly to sudden link failures, congestion, or load imbalances.
        Therefore, these AI
        services need FANTEL to provide real-time and proactive notifications for traffic
        engineering and load
        balancing, meeting the requirements of ultra-high throughput and lossless data transmission.
      </t>
    </abstract>

  </front>

  <middle>

    <section>
      <name>Introduction</name>
      <t>The rapid development of Artificial Intelligence (AI), particularly large language
        models(LLMs), necessitates substantial computing power. Leasing computing resources from
        third-party AI Data Centers (AIDCs) provides a cost-efficient and elastic solution for
        entities such as industry enterprises or research institutions that find building and
        maintaining their own DC costly. However, AI service traffic, characterized by massive
        volume, high burstiness, and sensitivity to packet loss and latency, poses significant
        challenges to IP WAN interconnecting multiple entities and AIDCs. Moreover, entities with
        strict security requirements may prefer to keep their datasets on-premises, which introduces
        challenges in remote access and distributed coordination.
      </t>
      <t>This document categorizes AI service scenarios over the WAN and analyzes representative use
        cases, including sample data migration, remote data access, coordinated model training
        between entities and AIDCs, coordinated model training across AIDCs, and coordinated model
        inference. Based on these use cases, this document summarizes the corresponding challenges
        and requirements, and discusses what and how the FANTEL architecture can address them.
      </t>
    </section>

    <section title="Conventions Used in This Document">
      <section>
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
          NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119" />
          <xref target="RFC8174" />
          when, and only when, they appear in all capitals, as shown here.</t>
      </section>

      <section title="Abbreviations">
        <ul spacing="normal">
          <li>AI: Artificial Intelligence
          </li>
          <li>AIDC: AI Data Center
          </li>
          <li>ECMP: Equal-Cost Multi Path
          </li>
          <li>FANTEL: Fast Notification for Traffic Engineering and Load Balancing
          </li>
          <li>INT: Inband Network Telemetry
          </li>
          <li>IOAM: In-situ Operations, Administration, and Maintenance
          </li>
          <li>LLM: Large Language Model
          </li>
          <li>MAN: Metropolitan Area Networks
          </li>
          <li>RDMA: Remote Direct Memory Access
          </li>
          <li>TE: Traffic Engineering
          </li>
          <li>TPOT: Time per Output Token
          </li>
          <li>TTFT: Time to First Token
          </li>
          <li>WAN: Wide Area Network
          </li>
        </ul>
      </section>
    </section>
    <!-- [CHECK] The 'Requirements Language' section is optional -->

    <section>
      <name>Use Cases</name>
      <section>
        <name>AI Service Scenarios in WAN</name>
        <t>With the rapid growth of AI service traffic, entities and AIDCs face increasing strain on
          their computing resources. To address this, WAN provides the essential foundation for
          integrating and delivering computing power across sites. Based on WAN, on-demand compute
          leasing offers an elastic, cost-effective way to scale AI resources.</t>
        <t>AI service scenarios in WAN can be categorized in into 3 scenarios, including sample data
          transfer, coordinated training, and coordinated inference, as shown in Figure 1.</t>
        <figure>
          <name>AI service scenarios in WAN</name>
          <artwork align="center"><![CDATA[
                   S2.1: Coordinated model training
                   between multiple AIDCs
       +--------+ <------------------> +--------+
       |  AIDCs |----------------------|  AIDCs |
       +--------+                      +--------+
            ^ |                           | ^
            | |                           | |   S2.2: Coordinated
            | |                           | |   model training
S1: Sample  | |                           | |   between
      Data  | |     Wide Area Network     | |   customer and AIDCs
  Transfer  | |                           | |
            | |                           | |   S3: Coordinated model
            | |                           | |   inference
            v |                           | v
       +------------+              +------------+
       |   Entity   |              |   Entity   |
       +------------+              +------------+
	        ]]>
          </artwork>
        </figure>
        <section>
          <name>Scenario 1: Sample Data Transfer</name>
          <t>Sample data transfer refers to the transfer of massive sample data from entities’ storage DCs to their own or third-party AIDCs for model training. Due to the diversity of data security requirements vary among entities, there are two sub-scenarios: </t>
          <t>1) Sample Data Migration: To meet the high-throughput and low-latency requirements of AI training, entities generally migrate their sample datasets to AIDC storages, where each training round can access terabytes to petabytes of data efficiently.</t>
          <t>2) Remote Sample Data Access: To protect sensitive data, entities with strict security requirements retain their datasets in on-premises storage rather than migrating them to AIDCs. This creates the need for secure, low-latency, lossless approaches that allow timely remote access to sample data during model training.</t>
        </section>
        <section>
          <name>Scenario 2: Coordinated Model Training</name>
          <t>The computing power required for AI model training is enormous and grows rapidly, especially as models scale in size and complexity. For example, the computing power demand of LLMs grows rapidly — it is estimated that GPT-6 requires ZFLOPS-scale computing power, reaching a ~2000x increase over GPT-4. A single DC would struggle to meet such enormous demands. Therefore, integrating dispersed computing resources to support LLMs training (including training and fine-tuning of foundational models) has become a key solution.</t>
          <t>1) Coordinated Model Training across AIDCs: The scalability of a single AIDC is inherently constrained by its physical infrastructure (e.g., space and power supply). In order to meet the enormous computing power requirements of training, one solution is to coordinate distributed computing resources across multiple AIDCs. It also helps fully utilize the idle computing resources available in different DCs.</t>
          <t>2) Coordinated Model Training between Entities and AIDCs: Some entities that have deployed AI facilities in their own DCs have to face rapidly growing computing resource demands. Instead of build new AI infrastructure, which is costly, they can coordinate with leased third-party AIDCs to supplement their capacity. Moreover, entities with strict security requirements can further enhance this approach by incorporating split learning with the input/output layers deployed locally to ensure sensitive data remains on-premises.</t>
        </section>
        <section>
          <name>Scenario 3: Coordinated Model Inference</name>
          <t>Entities that have deployed AI facilities in their own DCs also have to face the degradation of TTFT and TPOT as inference concurrency increases continuously. Since expanding on-premises AI capacity is often costly, leasing third-party AIDCs and coordinating inference between entities’ AIDCs and third-party AIDCs can be a cost-effective way to scale concurrent inference capacity.</t>
        </section>
      </section>
      <section>
        <name>Sample Data Migration</name>
        <section>
          <name>Use Case Description</name>
          <t>Since AI training requires multiple rounds of fine-tuning for performance improvement, and each round consumes massive sample data (ranging from terabytes to petabytes), entities need to upload these sample data into AIDCs as soon as possible to start the next round of training.</t>
          <t>Currently, many entities still rely on shipping physical hard drives to migrate such large datasets, which not only risks data loss if the drives are damaged but is also highly inefficient.</t>
          <t>For network-based solutions, entities typically have to rent dedicated line services with fixed-bandwidth on a monthly or annually subscription basis, which is less cost-effective because the data transfer traffic is bursty, meaning the bandwidth is only fully utilized during short transfer periods and remains idle for the rest of the time.</t>
        </section>
        <section>
          <name>Fast Notification Impact</name>
          <t>For high-efficiency and cost-effective sample-data migration, fast notification of network status enhances the WAN in the following ways:</t>
          <t>1) To maximize the available bandwidth for the data transmission, WAN should support fast notification of network status changes across devices, achieving efficient hourly transmission of terabyte-scale sample data by enabling real-time service bandwidth adjustment based on tasks. Moreover, the data migration services should be provisioned on a task basis, eliminating the need for entities to lease high-bandwidth lines on a monthly/yearly basis and thereby significantly reducing costs.</t>
          <t>2) During high-speed sample data migration, even minor network failures can trigger significant packet loss, sharply reducing the efficiency. To prevent this the WAN should provide millisecond-level fast failure notification, enabling rapid failure detection and failover to maintain high throughput.</t>
        </section>
        <section>
          <name>Example</name>
          <figure>
            <name>Sample Data Migration from Local Storage to AIDC</name>
            <artwork align="center"><![CDATA[
                   +----------+                                  
                   |Controller|                                  
                   +--+----^--+                                  
                      |    |                                     
           On-demand  |    |  Network                            
           Bandwidth  |    |  Status                             
                      |    |                                     
                      |    |                                     
                      |    |          +-------------------------+
   +---------+     +--v----+--+       |+-------+     +---------+|
   | Local   |     |   WAN    |       ||Storage+-----+Computing||
   | Storage +-----|          |-------++-------+     +---------+|
   +---------+     +----------+       |          AIDC           |
Local Sample Data                     +-------------------------+
(TB-PB Scale)                                                    
              ----------------------->                           
                   10 TB/day                                      
            ]]>
            </artwork>
          </figure>
          <t>An example of sample data migration is an enterprise that leases the AI services from a third-party AIDC. In this case, the enterprise collects and stores the sample data in its local storage and needs to transfer 10 TB data to AIDC every day. It would take several days if it transferred by shipping hard drives, or 10 days if transfer over a 100 Mbps link. </t>
          <t>Using real-time network status obtained from the fast notification mechanism, the controller can flexibly adjust service bandwidth on-demand in seconds and ensure high throughput via traffic engineering and load balancing. In this example, the controller adjusts the service bandwidth to 10 Gbps for 3 hours, which is sufficient to complete the 10 TB data migration, and dynamically updates the traffic engineering and load balancing policies to maintain high throughput.</t>
        </section>
      </section>
      <section>
        <name>Remote Sample Data Access</name>
        <section>
          <name>Use Case Description</name>
          <t>Some industries with highly sensitive data prefer to keep their datasets on-premises, avoiding the risk of leakage that may arise when migrating data to third-party AIDCs. </t>
          <t>The most common method for accessing such data during AI computing is to use RDMA protocols (e.g. InfiniBand and RoCE), which achieve ultra-low latency, actively bypass TCP's congestion control mechanism, and rely on the Go-back-N mechanism to handle packet loss and disorder. The Go-back-N mechanism retransmits all unacknowledged packets (including correctly received ones) after a timeout, causing the RDMA protocol extremely sensitive to packet loss -- even a 0.1% packet loss can cause throughput to drop by roughly 50%. </t>
          <t>To provide efficient AI services for these industries, robust congestion-control solutions are needed in WANs to minimize latency and packet loss.</t>
        </section>
        <section>
          <name>Fast Notification Impact</name>
          <t>Millisecond-level, lightweight notification can be sent to nodes adjacent to or affected by failure or congestion, enabling lossless transmission of sample data and meeting the stringent packet-loss tolerance requirement of RDMA transmission.</t>
        </section>
        <section>
          <name>Example</name>
          <figure>
            <name>Remote Sample Data Transmission via RDMA</name>
            <artwork align="center"><![CDATA[
                                                                   
                                      +---------------------------+
    +---------+    +-------------+    | +--------+    +---------+ |
    | Local   +----+     WAN     +----|-+ Sample +----+Parameter| |
    | Storage |    |             |    | | Plane  |    |Plane    | |
    +---------+    +-------------+    | +--------+    +---------+ |
Local Sample Data                     |          AIDC             |
                                      +---------------------------+
                                                                   
         <------------------------------------------------>        
                               RDMA
            ]]>
            </artwork>
          </figure>
          <t>An example of remote data access involves an enterprise with strict security requirements that prohibit storing sensitive data outside its premises, while still wishing to lease computing resources from third-party AIDCs. In this case, remote sample data are transmitted between the AIDC and the enterprise’s local storage via RDMA, which is highly sensitive to packet loss. The distance between the AIDC and the local storage may range from 100 to 500 km. </t>          
          <t>The fast notification mechanism enables flow-based precise congestion control through immediate congestion notification, ensuring lossless RDMA transmission and thereby supporting secure and efficient model training.</t>
        </section>
      </section>
      <section>
        <name>Coordinated Model Training across AIDCs</name>
        <section>
          <name>Use Case Description</name>
          <t>Due to the limited computing resources of a single DC, the training task can be split and coordinated across multiple AIDCs. This approach also helps fully utilize the idle computing resources available in different DCs. Coordinated model training across AIDCs requires the WAN to support massive, highly concurrent and bursty traffic of parameter synchronizations. </t>
        </section>
        <section>
          <name>Fast Notification Impact</name>
          <t>For coordinated model training across AIDCs, fast failure and congestion notification enhances WAN in the following ways:</t>
          <t>1) Dynamic load balancing: Fast network status notifications allow dynamic load balancing strategies to be deployed in real time, ensuring optimal utilization of network resources and maintaining high performance.</t>
          <t>2) Low-latency, lossless parameter synchronization: Fast notification enables millisecond-level congestion control in the WAN, allowing upstream devices to promptly reduce their transmission rates upon detecting impending congestion.</t>
          <t>3) Rapid failure protection: Interruptions in parameter synchronization due to network failures can trigger rollback and computation waste, sharply reducing training efficiency <xref target="draft-cheng-rtgwg-ai-network-reliability-problem" />. Fast notification enables millisecond-level failure detection and failover, minimizing training disruptions.</t>
        </section>
        <section>
          <name>Example</name>
          <figure>
            <name>Coordinated Model Training across AIDCs</name>
            <artwork align="center"><![CDATA[
+-------------+                       +-------------+
| +---------+ |    +-------------+    | +---------+ |
| |Parameter| +----+     WAN     +----+ |Parameter| |
| |Plane    | |    |             |    | |Plane    | |
| +---------+ |    +-------------+    | +---------+ |
|    AIDC     |                       |    AIDC     |
+-------------+          RDMA         +-------------+
    ^  <------------------------------------->   ^   
    |         Parameter Synchronization          |   
    |                                            |   
    |                                            |   
    |        Parallelization Strategies          |   
    +---------------------+----------------------+   
                          |                          
                 +--------+--------+                  
                 |LLM Training Task|                  
                 +-----------------+                  
            ]]>
            </artwork>
          </figure>            
          <t>An example of coordinated Model Training across AIDCs is splitting the training task of LLM using parallelization strategies such as pipeline parallelism and data parallelism. During model training, LLM parameters are synchronized across geographically distributed AIDCs via the WAN using the RDMA protocol. The parameter synchronization traffic is highly concurrent and bursty elephant flows, characterized by long duration and large data amount, which can easily cause network congestion.</t>
          <t>Leveraging fast network status notification, the WAN can perform efficient traffic engineering and load balancing. In addition, fast failure and congestion notifications enable flow-based precise congestion control, ensuring lossless and efficient parameter synchronization.</t>            
        </section>
      </section>
      <section>
        <name>Coordinated Model Training between Entities and AIDCs</name>
        <section>
          <name>Use Case Description</name>
          <t>Considering cost and security, some entities may choose to lease third-party AIDCs to meet their rapidly growing computing resource demands. In this case, split learning can be applied, where only input/output layers are deployed locally for data security, while the intermediate layers are deployed in third-party AIDCs for cost efficiency. 
            <!-- Unlike coordinated training among AIDCs, where parameters need to synchronize based on DP/MP, only  -->
            Activations and gradients are transmitted via the WAN. However, the transmission between entities and AIDCs still requires low latency and even more elastic bandwidth.</t>
        </section>
        <section>
          <name>Fast Notification Impact</name>
          <t>For coordinated model training between entities and AIDCs, fast notification mechanism enhances WAN performance in the following ways:</t>
          <t>1) Providing low latency through fast congestion and failure notification (as discussed in Section 3.4.2).</t>
          <t>2) Enabling elastic bandwidth allocation and dynamic traffic engineering and load balancing strategies through fast network status notification (as discussed in Section 3.2.2).</t>
        </section>
        <section>
          <name>Example</name>
          <figure>
            <name>Coordinated Model Training between entities and AIDCs</name>
            <artwork align="center"><![CDATA[
+------------------------+                       +-------------+         
| +-------+  +---------+ |    +-------------+    | +---------+ |         
| |Local  +--+Parameter| +----+     WAN     +----+ +Parameter| |         
| |Storage|  |Plane    | |    |             |    | |Plane    | |         
| +-------+  +---------+ |    +-------------+    | +---------+ |         
|        Entity          |                       |    AIDC     |         
+--------------^---------+          RDMA         +--------^----+         
     +-------+ |  <-------------------------------------> | +---------+
     |+-+ +-+| |          (Activations,Gradients)         | |+-+   +-+|
     ||1| |n|| |                                          | ||2|   |n||
     || | | || |                                          | || |...|-||
     || | | || |              Split Learning              | || |   |1||
     |+-+ +-+| +---------------------+--------------------+ |+-+   +-+|
     +-------+                       |                      +---------+
                             +-------+-------+                           
                             | Training Task |                           
                             +---------------+                                         
            ]]>
            </artwork>
          </figure>          
          <t>An example of coordinated model training between entities and AIDCs is that an entity only has to build minimum amount of computing resource to deploy input/output layers locally, while leasing third-party AIDC resources to deploy intermediate layers. During model training, the activations of the forward pass and gradients of the backward pass are transmitted via the WAN using the RDMA protocol, which are bursty and latency-sensitive.</t>
          <t>Fast network status notification enables flexible bandwidth allocation, dynamic traffic engineering and load balancing, while fast congestion notification ensures low latency.</t>          
        </section>
      </section>
      <section>
        <name>Coordinated Model Inference</name>
        <section>
          <name>Use Case Description</name>
          <t>Similar to Section 3.5, entities may also choose to lease third-party AIDCs for model inference. Some may completely depend on third-party AIDC for inference which require enough bandwidth and low latency to access via the WAN, while others lease third-party AIDCs as the supplement coordinating with the entities’ local inference which requires low latency, low packet loss and cost-effective transmission.</t>
        </section>
        <section>
          <name>Fast Notification Impact</name>
          <t>For coordinated model inference, fast notification mechanism enhances WAN performance in the following ways:</t>
          <t>1) Providing low latency through fast congestion and failure notification (as discussed in Section 3.4.2).</t>
          <t>2) Enabling elastic bandwidth allocation and dynamic traffic engineering and load balancing strategies through fast network status notification (as discussed in Section 3.2.2).</t>
        </section>
        <section>
          <name>Example</name>
          <figure>
            <name>Coordinated Model Inference</name>
            <artwork align="center"><![CDATA[
+------------------------+                       +-------------+             
| +-------+  +---------+ |    +-------------+    | +---------+ |             
| |Local  +--+Parameter| +----+     WAN     +----+ +Parameter| |             
| |Storage|  |Plane    | |    |             |    | |Plane    | |             
| +-------+  +---------+ |    +-------------+    | +---------+ |             
|        Entity          |                       |    AIDC     |             
+--------------^---------+          RDMA         +----------^--+             
  +---------+  | <------------------------------------->   | +---------+    
  |+-+   +-+|  |         (KV Cache, Activations)           | |+-+   +-+|    
  ||1|   |n||  |                                           | ||1|   |n||    
  || |...| ||  |                                           | || |...| ||    
  || |   | ||  |             Split Learning                | || |   | ||    
  |+-+   +-+|  +--------------------+----------------------+ |+-+   +-+|    
  +---------+                       |                        +---------+    
   Prefill instance          +-------+-------+           Decode Instance
                             |Inference Task |                               
                             +---------------+                                
              ]]>
            </artwork>
          </figure>
          <t>An example of coordinated model inference employs split learning to distribute the inference task between entities and AIDCs. The prefill instance is deployed locally and the decode instance is deployed in third-party AIDCs, since the decode phase has significantly higher GPU memory requirements compared to the prefill phase. This reduces the demand on the entity's local DC and keeps the prompts on-premises. Additionally, the input and output layers of the decode phase can remain in the entity’s DC to meet stricter data security requirements. During model inference, the key/value cache and intermediate activations are transmitted via the WAN using the RDMA protocol.</t>          
          <t>Similar to Section 3.5, fast network status notification enables flexible bandwidth allocation, dynamic traffic engineering and load balancing, while fast congestion and failure notification ensures low latency and effective congestion control.</t>
        </section>
      </section>
    </section>
    <section>
      <name>Challenges and Requirements</name>
      <t>The above use cases introduce elastic, cost-effective, and secure ways to scale and fully utilize the computing resources for AI services across multiple sites. However, these approaches involve long-distance transmission over the WAN of AI service traffic, which is characterized by massive volume, high burstiness, and sensitivity to packet loss and latency. These characteristics expose limitations in existing mechanisms, including delayed decision-making, coarse-grained feedback, and slow recovery: </t>
      <ul spacing="normal">
        <li>Load Balancing mechanisms (e.g., IOAM) typically depend on centralized control or static policies, resulting in delayed reaction to highly dynamic traffic, which may lead to congestion or packet-loss.</li>
        <li>Flow Control mechanisms (e.g., ECN <xref target="RFC3168" />) often rely on end-to-end feedback that is constrained by RTT delays, making it hard to achieve fine-grained, real-time adjustment.</li>
        <li>Failure Protection mechanisms (e.g., BFD <xref target="RFC5880" />, FRR <xref target="RFC7490" />) typically involve periodic detection and precomputed backup paths, which cannot always provide millisecond-level recovery in complex multi-domain environments. Moreover, increasing probe frequency to shorten detection time inevitably raises CPU and bandwidth overhead.</li>
      </ul>
      <t>To address these challenges, FANTEL architecture needs to provide fast, real-time, lightweight notifications for efficient load balancing, flow control, and failure protection, enabling elastic bandwidth, lossless transmission, and fast failover recovery for AI service traffic over the WAN, including: </t>
      <ul spacing="normal">
        <li>Fast Network Status Notification delivers real-time visibility into traffic patterns, link utilization, and node load to support timely adjustments of paths and traffic rates.</li>
        <li>Fast Congestion Notification provides low-latency, fine-grained feedback to enable immediate adjustments of data transmission rate or re-routing, preventing congestion and packet loss.</li>
        <li>Fast Failure Notification notifies link or node failures with real-time detection and precise propagation, allowing immediate responses such as switching to backup paths, rerouting traffic, or suppressing affected routes to ensure service reliability.</li>
      </ul>
    </section>
    <section anchor="IANA">
      <!-- All drafts are required to have an IANA considerations section. See RFC 8126 for a
      guide.-->
      <name>IANA Considerations</name>
      <t>N/A.</t>
    </section>

    <section anchor="Security">
      <!-- All drafts are required to have a security considerations section. See RFC 3552 for a
      guide. -->
      <name>Security Considerations</name>
      <t>TBD.</t>
    </section>

    <!-- NOTE: The Acknowledgements and Contributors sections are at the end of this template -->
  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>

        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml" />
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml" />
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml" />
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7490.xml" />
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5880.xml" />

        <!-- <reference anchor="draft-geng-fantel-fantel-requirements">
          <front>
            <title>Requirements of Fast Notification for Traffic Engineering and Load Balancing</title>
            <author>
            </author>
          </front>
        </reference>

        <reference anchor="draft-geng-fantel-fantel-gap-analysis">
          <front>
            <title>Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing</title>
            <author>
            </author>
          </front>
        </reference> -->

        <reference anchor="draft-cheng-rtgwg-ai-network-reliability-problem">
          <front>
            <title>Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing</title>
            <author>
            </author>
          </front>
        </reference>

        <!-- The recommended and simplest way to include a well known reference -->

      </references>
    </references>

    <section anchor="Contributors" numbered="false">
      <!-- [REPLACE/DELETE] a Contributors section is optional -->
      <name>Contributors</name>
      <t>Thanks to all the contributors.</t>
      <!-- [CHECK] it is optional to add a <contact> record for some or all contributors -->
    </section>

  </back>
</rfc>