<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-liu-nmrg-ai-llm-inference-requirements-00"
     ipr="trust200902">
  <front>
    <title abbrev="Network Management">Requirements Analysis of System and
    Network for Large Language Model Inference Service</title>

    <author fullname="Chang Liu" initials="C." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>liuchangjc@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Chuyi Guo" initials="C." surname="Guo">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>guochuyi@chinamobile.com</email>
      </address>
    </author>

    <!---->

    <date day="3" month="March" year="2025"/>

    <area>IRTF</area>

    <workgroup>Network Management</workgroup>

    <keyword>LLM inference;PD Fusion;PD Disaggregation;KV Cache</keyword>

    <abstract>
      <t>With the rise of ChatGPT, DeepSeek, and other Large Language Models,
      which is short for LLMs in the remaining part, as well as the
      proliferation of inference applications, inference serving oriented to
      large-scale users has become increasingly critical. However, due to the
      extreme demands on computing power and communication during inference,
      the large-scale service deployment of LLMs poses significant challenges.
      To address these challenges, different vendors have adopted diverse
      inference service architectures, among which the vLLM proposed in 2023
      is the most representative. This paper investigates mainstream inference
      frameworks, summarizes their core design principles, and analyzes the
      requirements and challenges they impose on system and network
      configurations. The goal is to lay a foundation for defining a unified
      LLM inference architecture in the future.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Since the launch of ChatGPT in 2023, more and more product-level LLMs
      have emerged, with GPT-4o, Claude-Sonnet-3.5, Gemini, Kimi, and others
      leading the charge. In early 2025, DeepSeek-R1 reignited the LLM frenzy,
      and Musk's xAI recently unveiled the powerful Grok3. It is evident that
      LLMs will continue to reach new heights.</t>

      <t>Major vendors, including OpenAI, Anthropic, DeepSeek, and Google,
      have deployed their LLM applications across mobile and web platforms. As
      the field grows, daily active users (DAUs) for these applications are
      expected to surge, potentially reaching hundreds of millions during peak
      periods. This presents significant challenges for large-scale inference
      services. For instance, up to now, DeepSeek still struggles with
      persistent "Service Busy" issues.</t>

      <t>Existing large-scale inference service architectures primarily adopt
      two technical approaches: Prefill-Decoding (PD) Fusion and
      Prefill-Decoding Disaggregation, which is derived from the distinct
      computational characteristics of the Prefill (compute-intensive) and
      Decoding (memory-intensive) phases. Efficient network management and
      hardware coordination are essential to maximize system throughput and
      minimize user-perceived latency.</t>

      <t>This document first introduces mainstream inference frameworks, then
      optimization metrics, and finally elaborates on the network and system
      requirements for deploying large-scale LLM inference services.</t>
    </section>

    <section title="Service-Oriented Inference Frameworks">
      <t>At present, there are two main technical routes of the mainstream LLM
      service systems, namely PD Fusion and PD Disaggregation. Prefill, which
      is to simultaneously compute all of tokens of user requests, also known
      as prompts, is characterized as computational intensive,
      computing-bound, with extremely high computing force requirements.
      Decoding generates user-required content based on the KV Cache and first
      token generated by Prefill phase. Due to the reuse of KV Cache of the
      tokens prior to the current token, it is characterized as
      memory-intensive and memory-bound, with higher requirements for memory
      in decoding phase. A complete LLM inference procedure is shown in Figure
      1. Based on whether to decouple two stages with obviously different
      computing requirements, two technical routes of LLM inference serving
      system emerge, namely, PD Fusion and decoupled PD Disaggregation. The
      rest of this section describes in detail about the two technical
      architectures.</t>

      <figure align="center" title="LLM Inference Process">
        <artwork align="center" type="ascii-art"> +-------------+  +-------------+ +-------------+ +-------------+      
 |     LLM     |  |     LLM     | |     LLM     | |     LLM     |      
 | Iteration 1 +-+| Iteration 2 ++| Iteration 3 ++| Iteration 4 ++     
 +-----^-------+ |+---^------^--+|+---^-------^-+|+---^-------^-+|     
       |         |    |      |   |    |       |  |    |       |  |     
       |         | +--+--+   |   | +--+--+    |  | +--+--+    |  |     
&lt;Prompt:Is apple | | KV  | &lt;Yes&gt; | | KV  |  &lt;It&gt; | | KV  |  &lt;Is&gt; |&lt;EOS&gt;
        a fruit?&gt;| |Cache|   ^   | |Cache|    ^  | |Cache|    ^  |  ^  
                 | +--^--+   |   | +--^--+    |  | +--^--+    |  |  |  
                 |    |      |   |    |       |  |    |       |  |  |  
                 +----+------+   +----+-------+  +----+-------+  +--+  
                                                                       
+-----Prefill----+--------------------Decoding----------------------+  </artwork>
      </figure>

      <t>Prefill: Processes all tokens in user prompts (Parallelizable,
      compute-bound, requiring high computing power).</t>

      <t>Decoding: Generates output tokens sequentially based on the KV Cache
      from Prefill (Memory-bound, requiring high GPU memory).</t>

      <section title="PD Fusion Architecture">
        <t>In PD Fusion, LLM instances are deployed within a single cluster,
        managed by a global scheduler responsible for load balancing, KV Cache
        management, and resource allocation. Most frameworks adopt vLLM<xref
        target="vLLM"/>'s paged KV Cache mechanism, inspired by OS virtual
        memory management. This approach stores KV Cache into non-contiguous
        physical blocks across nodes and uses a scheduler to map logical
        blocks to physical memory. Additionally, prefix-sharing strategies are
        employed to reuse KV Cache for prompts with identical prefixes,
        reducing redundant computations. Remote KV Cache replication across
        nodes is also required to reduce duplicated computing of KV Cache of
        same tokens. The architecture is shown in Figure 2.</t>

        <figure align="center" title="PD Fusion Architecture">
          <artwork align="center" type="ascii-art">                      Request1/Prompt1                        
                      Request2/Prompt2                        
                              |                               
                              |                               
                 +------------v------------+                  
                 |                         |                  
                 |  Scheduler/Controller   |                  
       Request1  |                         | Request2         
     +-----------+  *********************  +----------+       
     |           |  *KV Cache Management*  |          |       
     |           |  *  Load Balancing   *  |          |       
     |           |  *     ... ...       *  |          |       
     |           |  *********************  |          |       
     |           +-------------------------+          |       
     |                                                |       
     |                                                |       
     |                                                |       
+----v-----+  Remote    +----------+   Remote    +----v-----+ 
|  Model   |KVCache copy|  Model   | KVCache copy|  Model   | 
|Instance 1&lt;-----------&gt;|Instance 2|&lt;------------&gt;Instance 3| 
+----------+            +----------+             +----------+ </artwork>
        </figure>
      </section>

      <section title="PD Disaggregation Architecture">
        <t>In PD Disaggregation, Prefill and Decoding are decoupled into
        separate instances to optimize hardware utilization. After Prefill
        computes the full KV Cache for a prompt, the data is transferred to
        Decoding instances for text generation. This architecture demands
        efficient coordination between Prefill and Decoding instances, as well
        as reliable high-speed data transmission. The workflow is illustrated
        in Figure 3.</t>

        <figure align="center" title="PD Disaggregation Architecture">
          <artwork align="center" type="ascii-art">                      Request1/Prompt1                              
                      Request2/Prompt2                              
                              |                                     
                              |                                     
                 +------------v------------+                        
                 |                         |                        
                 |  Scheduler/Controller   |                        
       Request1  |                         | Request2               
     +-----------+  *********************  +----------+             
     |           |  *KV Cache Management*  |          |             
     |           |  *  Load Balancing   *  |          |             
     |           |  *     ... ...       *  |          |             
     |           |  *********************  |          |             
     |           +-------------------------+          |             
     |                                                |             
     |                                                |             
     |                                                |             
+----v-----+                                     +----v-----+       
|  Model   |                                     |  Model   |       
|          |       Remote KVCache copy           |          |       
| Prefill  &lt;-------------------------------------&gt; Prefill  |       
|Instance 1|                                     |Instance 2|       
+----+-----+                                     +----+-----+       
     |KV Cache                                KV Cache|             
     |Transfer                                Transfer|             
     |                                                |             
+----v-----+                                     +----v-----+       
|  Model   |                                     |  Model   |       
|          |                                     |          |       
|Decoding  |                                     |Decoding  |       
|Instance 1|                                     |Instance 2|       
+----------+                                     +----------+</artwork>
        </figure>
      </section>
    </section>

    <section title="System Goodput and Optimization Metrics">
      <t>The ultimate goals of an inference system are to maximize system
      goodput which reflects the serving volume of user requests and minimize
      user-perceived latency. For PD Disaggregation architectures, two key
      metrics are defined as follows:</t>

      <t>TTFT (Time to First Token): The time taken by the Prefill phase to
      generate the first token.</t>

      <t>TBT (Time Between Tokens): The interval between consecutive token
      generations in the Decoding phase.</t>

      <t>Optimization aims to minimize both TTFT and TBT under resource
      constraints and SLO constraints.</t>
    </section>

    <section title="Network and System Requirements for Service-Oriented Inference Frameworks">
      <t>To achieve large-scale LLM service deployment, frameworks MUST
      address the following challenges in both control plane and data
      plane.</t>

      <section title="Efficient Load Balancing">
        <t>Both PD Fusion and PD Disaggregation architectures require dynamic
        load balancing to prevent server overload. For PD Disaggregation,
        schedulers MUST consider compute constraints (Prefill) and memory
        constraints (Decoding) when distributing requests.</t>
      </section>

      <section title="KV Cache Management">
        <t>Effective KV Cache management is critical. Most frameworks adopt
        vLLM&rsquo;s paged KV Cache mechanism, schedulers are REQUIRED to
        handle memory allocation, cross-request KV cache sharing, and KV cache
        replacement policies. Future optimizations must address exponential
        user growth and ensure efficient cache synchronization across clusters
        or nodes.</t>
      </section>

      <section title="KV Cache Transmission">
        <t>PD Disaggregation architectures demand high-speed, reliable
        transmission of KV Cache data between Prefill and Decoding instances.
        The network MUST provide low-latency, high-bandwidth channels to
        ensure seamless coordination.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="vLLM">
        <front>
          <title>Efficient Memory Management for Large Language Model Serving
          with PagedAttention</title>

          <author fullname="Woosuk Kwon" surname="Kwon">
            <organization>UC Berkeley</organization>
          </author>

          <date year="2023"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
