<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-li-nmrg-dtn-data-generation-optimization-00" ipr="trust200902">
  <front>
    <title abbrev="Data Generation and Optimization for DTN">Data Generation and Optimization for
      Digital Twin Network Performance Modeling</title>

    <author fullname="Mei Li" initials="M." surname="Li">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>limeiyjy@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Cheng Zhou" initials="C." surname="Zhou">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>zhouchengyjy@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Danyang Chen" initials="D." surname="Chen">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>chendanyang@chinamobile.com</email>
      </address>
    </author>

    <!---->

    <date year="2023"/>

    <area>Networking</area>

    <workgroup>Internet Research Task Force</workgroup>

    <keyword>Digtial Twin; Digital Twin Network; Network Performance; Data Generation; Data
      Optimization</keyword>

    <abstract>
      <t>Digital Twin Network (DTN) can be used as a secure and cost-effective environment for
        network operators to evaluate network performance in various what-if scenarios. Recently, AI
        models, especially neural networks, have been applied for DTN performance modeling. The
        quality of deep learning models mainly depends on two aspects: model architecture and data.
        This memo focuses on how to improve the model from the data perspective.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
        "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in
          <xref target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="Intro" title="Introduction">
      <t>Digital twin is a virtual instance of a physical system (twin) that is continually updated
        with the latter's performance, maintenance, and health status data throughout the physical
        system's life cycle. Digital Twin Network (DTN) is a digital twin that is used in the
        context of networking <xref target="I-D.irtf-nmrg-network-digital-twin-arch"/>. DTN can be
        used as a secure and cost-effective environment for network operators to evaluate network
        performance in various what-if scenarios. Recently, AI models, especially neural networks,
        have been applied for DTN performance modeling.</t>

      <t>The quality of AI models mainly depends on two aspects: model architecture and data. This
        memo focuses on the impact of training data on the model. The quality of training data will
        directly affect the accuracy and generalization ability of the model. This memo focuses on
        how to design data generation and optimization methods for DTN performance modeling, which
        can generate simulated network data to solve the problem of practical data shortage and
        select high-quality data from various data sources. Using high-quality data for training can
        improve the accuracy and generalization ability of the model.</t>
    </section>

    <section title="Acronyms &amp; Abbreviations">
      <t><list style="hanging">
          <t hangText="DTN:">Digital Twin Network</t>

          <t hangText="AI:">Artificial Intelligence</t>

          <t hangText="AIGC:">AI-Generated Content</t>

          <t hangText="ToS:">Type of Service (ToS)</t>

          <t hangText="OOD:">Out-of-Distribution</t>
        </list></t>
    </section>

    <section anchor="Require" title="Requirements">
      <t>Performance modeling is vital in DTN, which is involved in typical network management
        scenarios such as planning, operation, optimization, and upgrade. Recently, some studies
        have applied AI models to DTN performance modeling, such as RouteNet <xref target="RouteNet"
        /> and MimicNet <xref target="MimicNet"/>. AI is a data-driven technology whose performance
        heavily depends on data quality.</t>

      <t>Network data sources are diverse and of varying quality, making it difficult to directly
        serve as training data for DTN performance models:</t>

      <t><list style="symbols">
          <t>Practical data from production networks: Data from production networks usually have
            high value, but the quantity, type, and accuracy are limited. Moreover, it is not
            practical in production networks to collect data under various configurations;</t>

          <t>Network simulators: Network simulators (e.g., NS-3 and OMNeT++) can be used to generate
            simulated network data, which can solve the problems of quantity, diversity, and
            accuracy to a certain extent. However, simulation is usually time-consuming. In
            addition, there are usually differences between simulated data and practical data from
            production networks, which hinders the application of trained models to production
            networks;</t>

          <t>Generative AI models: With the development of AI-Generated Content (AIGC) technology,
            generative AI models (e.g., GPT and LLaMA) can be used to generate simulated network
            data, which can solve the problems of quantity and diversity to a certain extent.
            However, the accuracy of the data generated by generative AI models is limited and often
            has gaps with practical data from production networks.</t>
        </list></t>

      <t>Therefore, data generation and optimization methods for DTN performance modeling are
        needed, which can generate simulated network data to solve the problem of practical data
        shortage and select high-quality data from multi-source data. High-quality data meets the
        requirements of high accuracy, diversity, and fitting the actual situation of practical
        data. Training with high-quality data can improve the accuracy and generalization of DTN
        performance models.</t>
    </section>

    <section anchor="Framework" title="Framework of Data Generation and Optimization">
      <t>The framework of data generation and optimization for DTN performance modeling is shown in
          <xref target="framework"/>, which includes two stages: the data generation stage and the
        data optimization stage.</t>

      <figure anchor="framework"
        title="Framework of Data Generation and Optimization for DTN Performance Modeling">
        <artwork>       Data generation                   Data optimization
+---------------------------+ +-------------------------------------+
|                           | |                                     |
| +---------+               | |              +---------+            |
| |         |               | | +----------+ |         |            |
| | Network |               | | | Practical| | Easy    |            |
| | topology| +-----------+ | | | data     | | samples |            |
| |         | |           | | | +-----+----+ |         |            |
| |         | | Network   | | |       |      |         | +--------+ |
| |         | | simulator | | | +-----v----+ |         | |        | |
| | Routing | |           | | | |          | | Hard    | | High   | |
| | policy  +-&gt;           +-+-+-&gt; Candidate+-&gt; samples +-&gt; quality| |
| |         | |           | | | | data     | |         | | data   | |
| |         | | Generative| | | |          | |         | |        | |
| |         | | AI model  | | | +----------+ |         | +--------+ |
| | Traffic | |           | | |              | OOD     |            |
| | matrix  | +-----------+ | |              | samples |            |
| |         | Data generator| |              | (remove)|            |
| +---------+               | |              |         |            |
|  Network                  | |              +---------+            |
|  configuration            | |             Data selection          |
|                           | |                                     |
+---------------------------+ +-------------------------------------+      </artwork>
      </figure>

      <section title="Data Generation Stage">
        <t>The data generation stage aims to generate candidate data (simulated network data) to
          solve the problem of the shortage of practical data from production networks. This stage
          first generates network configurations and then imports them into data generators to
          generate the candidate data.</t>

        <t><list style="symbols">
            <t>Network configurations: Network configurations typically include network topology,
              routing policy, and traffic matrix. These configurations need to be diverse to cover
              as many scenarios as possible. Topology configurations include the number and
              structure of nodes and edges, node buffers' size and scheduling strategy, link
              capacity, etc. Routing policy determines the path of a packet from the source to the
              destination. The traffic matrix describes the traffic entering/leaving the network,
              which includes the traffic's source, destination, time and packet size distribution,
              Type of Service (ToS), etc.</t>

            <t>Data generators: Data generators can be network simulators (e.g., NS-3 and OMNeT++)
              and/or the generative AI models (e.g., GPT and LLaMA). Network configurations are
              imported into data generators to generate candidate data.</t>
          </list></t>
      </section>

      <section title="Data Optimization Stage">
        <t>The data optimization stage aims to optimize the candidate data from various sources to
          select high-quality data.</t>

        <t><list style="symbols">
            <t>Candidate data: Candidate data includes simulated network data generated in the data
              generation stage and the practical data from production networks.</t>

            <t>Data selection: The data selection module investigates the candidate data to filter
              out the easy, hard, and Out-of-Distribution (OOD) samples. Hard examples refer to
              samples that are difficult for the model to accurately predict. During the training
              process, exposing the model to more hard examples will enable it to perform better on
              such samples later on. Then the easy samples and hard samples are considered valid
              samples and added to the training data. OOD samples are considered invalid and
              removed.</t>

            <t>High-quality data: High-quality data needs to meet the requirements of high accuracy,
              diversity, and fitting the actual situation of practical data, which can be verified
              by expert knowledge (such as the ranges of delay, queue utilization, link utilization,
              and average port occupancy).</t>
          </list></t>
      </section>
    </section>

    <section anchor="Discussion" title="Discussion">
      <t>Several topics related to data generation and optimization for DTN performance modeling
        require further discussion.</t>

      <t><list style="symbols">
          <t>Data generation methods: 1) Generate configurations that cover enough scenarios and
            scale from small to large networks. 2) Choose data generators that consider accuracy,
            speed, fidelity, etc. 3) Use data augmentation technology to expand the training data by
            using a small amount of practical data to generate similar data through prior
            knowledge.</t>

          <t>Data optimization methods: 1) Select data from multi-source candidate data, including
            hard sample mining, OOD detection, etc. 2) Verify whether the data quality meets the
            requirements.</t>

          <t>Deployment: 1) Time/space complexity and explainability of the data generation and
            optimization methods. 2) Provide feedback for data collection to form a closed loop.</t>
        </list></t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document has no requests to IANA.</t>
    </section>
  </middle>

  <back>
    <references title="Informative References">
      <?rfc include="reference.I-D.irtf-nmrg-network-digital-twin-arch"?>

      <reference anchor="RouteNet">
        <front>
          <title>RouteNet: Leveraging Graph Neural Networks for network modeling and optimization in
            SDN. IEEE Journal on Selected Areas in Communication (JSAC), vol. 38, no. 10</title>

          <author fullname="K. Rusek" initials="K. Rusek" surname="Rusek">
            <organization/>
          </author>

          <author fullname="J. Su&aacute;rez-Varela" initials="J. Su&aacute;rez-Varela"
            surname="Su&aacute;rez-Varela">
            <organization/>
          </author>

          <author fullname="P. Almasan" initials="P. Almasan" surname="Almasan">
            <organization/>
          </author>

          <author fullname="P. Barlet-Ros" initials="P. Barlet-Ros" surname="Barlet-Ros">
            <organization/>
          </author>

          <author fullname="A. Cabellos-Aparicio" initials="A. Cabellos-Aparicio"
            surname="Cabellos-Aparicio">
            <organization/>
          </author>

          <date month="October" year="2020"/>
        </front>
      </reference>

      <reference anchor="MimicNet">
        <front>
          <title>MimicNet: Fast Performance Estimates for Data Center Networks with Machine
            Learning. In ACM SIGCOMM 2021 Conference (SIGCOMM &rsquo;21).</title>

          <author fullname="Q. Zhang" initials="Q. Zhang" surname="Zhang">
            <organization/>
          </author>

          <author fullname="K. K.W. NG" initials="K. K.W. NG" surname="NG">
            <organization/>
          </author>

          <author fullname="C. W. Kazer" initials="C. W. Kazer" surname="Kazer">
            <organization/>
          </author>

          <author fullname="S. Yan" initials="S. Yan" surname="Yan">
            <organization/>
          </author>

          <author fullname="J. Sedoc" initials="J. Sedoc" surname="Sedoc">
            <organization/>
          </author>

          <author fullname="V. Liu" initials="V. Liu" surname="Liu">
            <organization/>
          </author>

          <date month="August" year="2021"/>
        </front>
      </reference>
    </references>

    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>
  </back>
</rfc>
