<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="2"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-ietf-nmop-network-anomaly-architecture-04"
     ipr="trust200902" consensus="true" submissionType="IETF">
  <front>
    <title abbrev="Network Anomaly Detection Framework">A Framework for
		a Network Anomaly Detection Architecture</title>

    <author fullname="Thomas Graf" initials="T" surname="Graf">
      <organization>Swisscom</organization>

      <address>
        <postal>
          <street>Binzring 17</street>

          <city>Zurich</city>

          <code>8045</code>

          <country>Switzerland</country>
        </postal>

        <email>thomas.graf@swisscom.com</email>
      </address>
    </author>

    <author fullname="Wanting Du" initials="W" surname="Du">
      <organization>Swisscom</organization>

      <address>
        <postal>
          <street>Binzring 17</street>

          <city>Zurich</city>

          <code>8045</code>

          <country>Switzerland</country>
        </postal>

        <email>wanting.du@swisscom.com</email>
      </address>
    </author>

    <author fullname="Pierre Francois" initials="P." surname="Francois">
      <organization>INSA-Lyon</organization>

      <address>
        <postal>
          <street/>

          <city>Lyon</city>

          <region/>

          <code/>

          <country>France</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>pierre.francois@insa-lyon.fr</email>

        <uri/>
      </address>
    </author>

    <author fullname="Alex Huang Feng" initials="A." surname="Huang Feng">
      <organization>INSA-Lyon</organization>

      <address>
        <postal>
          <street/>

          <city>Lyon</city>

          <region/>

          <code/>

          <country>France</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>alex.huang-feng@insa-lyon.fr</email>

        <uri/>
      </address>
    </author>

    <date day="04" month="July" year="2025"/>

    <area>Operations and Management</area>

    <workgroup>NMOP</workgroup>

    <abstract>
      <t>This document describes the motivation and architecture of a
			Network Anomaly Detection Framework and the relationship to other
			documents describing network Symptom semantics and network
			incident lifecycle.</t>

      <t>The described architecture for detecting IP network service
      interruption is designed to be generic applicable and extensible.
			Different applications are described and examples are referenced
			with open-source running code.</t>
    </abstract>

    <note removeInRFC="true">
      <name>Discussion Venues</name>

      <t>Discussion of this document takes place on the Operations and
      Management Area Working Group Working Group mailing list
      (nmop@ietf.org), which is archived at <eref
      target="https://mailarchive.ietf.org/arch/browse/nmop/"/>.</t>

      <t>Source for this draft and an issue tracker can be found at <eref
      target="https://github.com/ietf-wg-nmop/draft-ietf-nmop-network-anomaly-architecture/"/>
      .</t>
    </note>
  </front>

  <middle>
    <section anchor="Introduction" title="Introduction">
      <t>Today's highly virtualized large scale IP networks are a
			challenge for network operation to monitor due to its vast number
			of dependencies. Humans are no longer capable to verify manually
			all the dependencies end to end in a timely manner.</t>

      <t>IP networks are the backbone of today's society. We
			individually depend on networks fulfilling the purpose of
			forwarding IP packets from a point A to a point B at any time of
			the day. A loss of such connectivity for a short period of time
			has today manyfold implications that can range from minor to
			severe. An interruption can lead to being unable to browse the
			web, watch a soccer game, access the company intranet or, even in
			life threatening situations, no longer being able to reach
			emergency services. Further, a congestion in the network leading
			to delayed packet forwarding can lead to severe repercussions on
			real-time applications.</t>

      <t>Networks are generally deterministic. However, the usage of
			networks are only somewhat. Humans, as in a large group of people,
			are somehow predictable. There are time of the day patterns in
			terms of when we are eating, sleeping, working or leisure. And
			these patterns are potentially changing depending on age,
			profession and cultural background.</t>

      <section anchor="Motivation" title="Motivation">
        <t>When operational or configurational changes in connectivity
        services are happening, it is crucial for network operators to
				detect interruptions within the network faster than the users
				utilizing the connectivity services.</t>

        <t>In order to achieve this objective, automation in network
        monitoring is required. The amount of people operating the
        network are today simply outnumbered by the amount of people
				utilizing connectivity services.</t>

        <t>This automation needs to monitor network changes holistically
				by supervising all 3 network planes simultaneously for a given
        connectivity service. The monitoring system needs to detect
				whether configurational or operational State changes, an 
				interface was shutdown by an operator versus an interface State
				went down due to loss of signal on the optical layer, are
				service disruptive, e.g. the received packets from customers are
				no longer forwarded to the desired destination, or not. A State
				change in control plane and management plane together indicate
				a network topology State change while a State change in the
				forwarding plane describes how the packets are being forwarded.
				In other words, control and management plane State changes can
				be attributed to network topology State changes whereas
				forwarding plane State changes are related to the outcome
				of these network topology State changes.</t>

        <t>Since changes in networks are happening all the time due to
				the vast number of dependencies, a scoring system is needed to
				indicate whether the change is considered disruptive. The
				scoring system needs to take into account the amount of
				transport sessions, the amount of affected flows and whether the
				detected interruptions are usual or exceptional.</t>
      </section>

      <section anchor="Scope" title="Scope">
        <t>Such objectives can be achieved by applying checks on network
        modeled time series data that contains semantics describing
				their dependencies across network planes. These checks can be
				based on domain knowledge or using outlier detection techniques.
        Domain-knowledge-based techniques applies the expertise of
				network engineers operating a network to understand whether
				there is an issue impacting the customer or not. On the other
				hand, outlier detection techniques identify measurements that
				deviate significantly from the norm and therefore are considered
				anomalous.</t>

        <t>The described scope does not take the connectivity service
				intent into account nor does it verify whether the intent is
				being achieved all the time. Changes to the service intent
				causing service disruptions are therefore considered service
				disruptions where on monitoring systems taking the intent into
				account this is considered as intended.</t>
      </section>
    </section>

    <section anchor="Conventions_and_Definitions"
             title="Conventions and Definitions">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
			"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",
			"NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
			be interpreted as described in BCP 14 <xref target="RFC2119"/>
			<xref target="RFC8174"/> when, and only when, they appear in all
			capitals, as shown here.</t>

      <section anchor="Terminology" title="Terminology">
        <t>This document defines the following terms:</t>

        <t>Outlier Detection: Is a systematic approach to identify rare
				data points deviating significantly from the majority.</t>

        <t>Service Disruption Detection (SDD): The process of detecting
				a service degradation by discovering anomalies in network
				monitoring data.</t>

        <t>Service Disruption Detection System (SDDS): A system allowing
				to perform SDD.</t>

        <t>Additionally it makes use of the terms defined in <xref
        target="I-D.ietf-nmop-terminology"/>, <xref
        target="I-D.ietf-nmop-network-anomaly-lifecycle"/> and
				<xref target="RFC8969"/>.</t>

        <t>The following terms are used as defined in <xref
        target="I-D.ietf-nmop-terminology"/> :</t>

        <t><list style="symbols">
            <t>Resource</t>

            <t>Event</t>

            <t>State</t>

            <t>Relevance</t>

            <t>Problem</t>

            <t>Symptom</t>

            <t>Alarm</t>
          </list></t>

        <t>Figure 2 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows characteristics of
				observed operational network telemetry metrics.</t>

        <t>Figure 4 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows relationships
				between, state, relevant state, problem, symptom, cause and
				alarm.</t>

        <t>Figure 5 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows relationships between
        problem, symptom and cause.</t>

        <t>The following terms are used as defined in <xref
        target="I-D.ietf-nmop-network-anomaly-lifecycle"/> :</t>

        <t><list style="symbols">
            <t>False Positive</t>

            <t>False Negative</t>
          </list></t>

        <t>The following terms are used as defined in <xref
				target="RFC8969"/> :</t>

        <t><list style="symbols">
            <t>Service Model</t>
          </list></t>

      </section>

      <section anchor="Outlier_Detection" title="Outlier Detection">
        <t>Outlier Detection, also known as anomaly detection, describes
				a systematic approach to identify rare data points deviating
        significantly from the majority. Outliers can manifest as single
				data point or as a sequence of data points. There are multiple
				ways in general to classify anomalies, but for the context of
				this draft, the following three classes are taken into account:
				</t>

        <dl>
          <dt>Global outliers:</dt>

          <dd>An outlier is considered "global" if its behavior is
					outside the entirety of the considered data set. For example,
					if the average dropped packet count is between 0 and 10 per
					minute and, in a small time-window, the value gets to 1000,
					this data point is considered a global anomaly.</dd>
        </dl>

        <dl>
          <dt>Contextual outliers:</dt>

          <dd>An outlier is considered "contextual" if its behavior is
					within a normal (expected) range, but it would not be expected
					based on some context. Context can be defined as a function of
					multiple parameters, such as time, location, etc. An example
					of a contextual outlier is when the forwarded packet volume
					overnight reaches levels which might be totally normal for the
					daytime, but anomalous and unexpected for the nighttime.</dd>
        </dl>

        <dl>
          <dt>Collective outliers:</dt>

          <dd>An outlier is considered "collective" if the behavior of
					each single data point that are part of the anomaly are within
					expected ranges (so they are not anomalous in either a
					contextual or a global sense), but the group, taking all the
					data points together, is. Note that the group can be made
					within a single time series (a sequence of data points is
					anomalous) or across multiple metrics (e.g. if looking at two
					metrics together, the combined behavior turns out to be
					anomalous). In Network Telemetry time series, one way this can
          manifest is that the amount of network paths and interface
					State changes matches the time range when the forwarded packet
					volume decreases as a group.</dd>
        </dl>

        <t>For each outlier a score between 0 and 1 is being calculated.
				The higher the value, the higher the probability that the
				observed data point is an outlier. <xref target="VAP09">Anomaly
				detection: A survey</xref> gives additional details on anomaly
				detection and its types.</t>
      </section>

      <section anchor="Knowledge_Based_Detection"
               title="Knowledge Based Detection">
        <t>Knowledge-based anomaly detection, a superset of rule-based
				anomaly detection, is a technique used to identify anomalies or
				outliers by comparing them against predefined rules or patterns.
				This approach relies on the use of domain-specific knowledge to
				set standards, thresholds, or rules for what is considered
				"normal" behavior. Traditionally, these rules are established
				manually by a knowledgeable network engineer. Forward-looking,
				these rules can be expressed using human and machine readable
				network protocol derived Symptoms and patterns defined in
				ontologies.</t>

        <t>Additionally, in the context of network anomaly detection,
				the knowledge-based approach works hand in hand with the
				deterministic understanding of the network, which is reflected
				in network modeling. Components are organized into three network
				planes: the Management Plane, the Control Plane, and the
				Forwarding Plane <xref target="RFC9232"/>.  A component can
				relate to a physical, virtual, or configurational entity, or to
				a sum of packets belonging to a flow being forwarded in a
				network.</t>

        <t>Such relationships can be modelled in a SIMAP to
        automate that process. <xref
				target="I-D.ietf-nmop-simap-concept"/> defines the
				concepts for the SIMAP and <xref
				target="I-D.havel-nmop-digital-map"/> defines an application of
				the SIMAP to network topologies.</t>
		
        <t>These relationships can also be modeled in a Knowledge Graph
				<xref section="5" sectionFormat="of"
				target="I-D.mackey-nmop-kg-for-netops"/> where ontologies can be
				used to augment the relationships among different network
				elements in the network model.</t>
      </section>

      <section anchor="Data_Mesh" title="Data Mesh">
        <t>The <xref target="Deh22">Data Mesh</xref> Architecture
        distinguishes between operational and analytical data.
				Operational data refers to collected data from operational
				systems. While analytical data refers to insights gained from
				operational data.</t>

        <section anchor="Operational_Network_Data"
                 title="Operational Network Data">
          <t>In terms of network observability, semantics of operational
          network metrics are defined by IETF and are categorized as
					described in the Network Telemetry Framework <xref
					target="RFC9232"/> in the following three different network
					planes:</t>

          <dl>
            <dt>Management Plane:</dt>

            <dd>Time series data describing the State changes and
						statistics of a network node and its Resources. For example,
						Interface State and statistics modeled in
						ietf-interfaces.yang <xref target="RFC8343"/>.</dd>
          </dl>

          <dl>
            <dt>Control Plane:</dt>

            <dd>Time series data describing the State and State changes
						of network reachability. For example, BGP VPNv6 unicast
						updates and withdrawals exported in BGP Monitoring Protocol
						(BMP) <xref target="RFC7854"/> and modeled in BGP <xref
            target="RFC4364"/>.</dd>
          </dl>

          <dl>
            <dt>Forwarding Plane:</dt>

            <dd>Time series data describing the forwarding behavior of
						packets and its data-plane context. For example, dropped
						packet count modelled in IPFIX entity forwardingStatus(IE89)
						<xref target="RFC7270"/> and packetDeltaCount(IE2) <xref
            target="RFC5102"/> and exportet with IPFIX <xref
            target="RFC7011"/>.</dd>
          </dl>
        </section>

        <section anchor="Analyticsl_Observed_Symptoms"
                 title="Analytical Observed Symptoms">
          <t>The Service Disruption Detection process takes operational
					network data as input and generates analytical metrics
					describing Symptoms and outlier pattern of the connectivity
          service disruption.</t>

          <t>The observed Symptoms are categorized into a semantic triple
          <xref target="W3C-RDF-concept-triples"/>: action,
          reason, trigger. The object is the action, decribing
          the change in the network. The reason is the predicate, 
          defining why this changed occured and the subject is the
          trigger, which defines what triggered that change.</t>

          <t>Symptom definitions are described in <xref section="3"
          sectionFormat="of"
          target="I-D.ietf-nmop-network-anomaly-semantics"/> and
          outlier pattern semantics in <xref section="4"
					sectionFormat="of"
          target="I-D.ietf-nmop-network-anomaly-lifecycle"/>. Both are
					expressed in YANG Service Models.</t>
		  
          <t>However the semantic triples could also be expressed with
					the Semantic Web Technology Stack in RDF, RDFS and OWL
					definitions as described in <xref section="6"
					sectionFormat="of" target="I-D.mackey-nmop-kg-for-netops"/>.
					Together with the ontology definitions described in <xref
					target="Knowledge_Based_Detection"/>, a Knowledge Graph can be
					created describing the relationship between the network state
					and the observed Symptom.</t>
        </section>
      </section>
    </section>

    <section anchor="Elements_of_the_Architecture"
             title="Elements of the Architecture">
      <t>A system architecture aimed at detecting service disruptions is
      typically built upon multiple components, for which design choices
			need to be made. In this section, we describe the main components
			of the architecture, and delve into considerations to be made when
			designing such componenents in an implementation.</t>

      <t>The system architecture is illustrated in <xref 
			target="fig_arch"/> and its main components are described in the
			following subsections.</t>

      <figure align="center" anchor="fig_arch"
              title="Service Disruption Detection System Architecture">
        <artwork align="center"><![CDATA[

         +---------+                     +-------------------+
         |Service  |                     |     Alarm and     |
    |--- |Inventory|                     | Problem Management|
    |    |         |                     |      System       |
    |    +---------+                     +-------------------+
    |      |                                      ^     Stream
    |      |                                      |
    |      |       +---------+           +-------------------+
    |      |       | Post-   | Stream    |   Message Broker  |
    |      |       | mortem  | <-------- |  with Analytical  |
    |      |       | System  |           |    Network Data   |
    |      |       +---------+           +-------------------+
    |      |            |                         ^     Stream
    |      |            |                         |
    |      |            |                +-------------------+
    |      | Profile    | Fine           | Alarm Aggregation | Store Label
    |      | and        | Tune           | for Anomaly       | ------------|
    |      | Generate   | SDD            | Detection         |             |
    |      | SDD Config | Config         +-------------------+             |
    |      |            |                       ^  ^  ^ Stream             |
    |      v            v                       |  |  |                    v
    |   +-------------------+            +-------------------+        +---------+
    |   | Service Disruption| Schedule   | Service Disruption| Replay |  Data   |
    |   |     Detection     | ---------> |     Detection     |<------ | Storage |
    |   |   Configuration   | Detection  |                   |        |         |
    |   +-------------------+            +-------------------+        +---------+
    |                                       ^ ^ Stream ^ ^ ^               ^
    |                                       | |        | | |               |
    |                                    +---------+---------+             |
    |                                    | Network |  Data   | Store       |
    |----------------------------------> |  Model  |  Aggr.  | ------------|
                                         |         | Process | Operational Data
                                         +---------+---------+
                                                ^  ^  ^ Stream
                                                |  |  |
                                         +-------------------+
                                         |   Message Broker  |
                                         |  with Operational |
                                         |    Network Data   |
                                         +-------------------+
                                                ^  ^  ^ Stream
 Subscribe                       Publish        |  |  |
           +-------------------+         +-------------------+
           | Network Node with | ------> | Network Telemetry |
 --------> | Network Telemetry | ------> |  Data Collection  |
           |   Subscription    | ------> |                   |
           +-------------------+         +-------------------+

      ]]></artwork>
      </figure>

      <section anchor="Arch_Inventory" title="Service Inventory">
        <t>A service inventory is used to obtain a list of the
				connectivity services for which Anomaly Detection is to be
				performed. A service profiling process may be executed on the
				service in order to define a configuration of the service
				disruption detection approach and parameters to be used.</t>
      </section>

      <section anchor="Arch_Configuration" title="SDD Configuration">
        <t>Based on this service list and potential preliminary service
        profiling, a configuration of the Service Disruption Detection
				is produced. It defines the set of approaches that need to be
				applied to perform SDD, as well as parameters, grouped in
				templates, that are to be set when executing the algorithms
				performing SDD per se.</t>

        <t>As the service lives on, the configuration may be adapted as
				a result of an evolution of the profiling being performed.
        Postmortem analysis are produced as a result of Events impacting
				the service, or the occurrence of false positives raised by the
				Alarm system. These postmortem analysis can lead to improvements 
        of the deployed profiles parameters and creation of new customer
				profiles.</t>
      </section>

      <section anchor="Arch_Collection" title="Operational Data
			Collection">
        <t>Collection of network monitoring data involves the management
				of the subscriptions to network telemetry on nodes of the
				network, and the configuration of the collection infrastructure
				to receive the monitoring data produced by the network.</t>

        <t>The monitoring data produced by the collection infrastructure
				is then streamed through a message broker system, for further
        processing.</t>

        <t>Networks tend to produce extremely large amounts of
				monitoring data. To preserve scaling and reduce costs, decisions
				need to be made on the duration of retention of such data in
				storage, and at which level of storage they need to be kept. A
				retention time need to be set on the raw data produced by the
				collection system, in accordance to their utility for further
				used. This aspect will be elaborated in further sections.</t>
      </section>

      <section anchor="Arch_Aggregation" title="Operational Data
			Aggregation">
        <t>Aggregation is the process of producing data upon which
				detection of a service disruption can be performed, based on
				collected network monitoring data.</t>

        <t>Pre-processing of collected network monitoring data is 
				usually performed so as to produce input for the Service
				Disruption Detection component. This can be achieved in multiple
				ways, depending on the architecture of the SDD component. As an
				example, the granularity or cardinality at which forwarding plane
				data is produced by the network may be too high for the SDD
				algorithms, and instead be aggregated into a coarser dimension
				for SDD execution.</t>

        <t>A retention time also needs to be decided upon for Aggregated
				data. Note that the retention time must be set carefully, in
				accordance with the replay ability requirement discussed in <xref
        target="Arch_Replaying"/>.</t>
      </section>

      <section anchor="Arch_SDD" title="Service Disruption Detection">
        <t>Service Disruption Detection processes the aggregated network
				data in order to decide whether a service is degraded to the
				point where network operation needs to be alerted of an ongoing
				Problem within the network.</t>

        <t>Two key aspects need to be considered when designing the SDD
        component. First, the way the data is being processed needs to
				be carefully designed, as networks typically produce extremely
				large amounts of data which may hinder the scalability of the
				architecture. Second, the algorithms used to make a decision to
				alert the operator need to be designed in such a way that the
				operator can trust that a targeted Service Disruption will be
				detected (no false negatives), while not spamming the operator
				with Alarms that do not reflect an actual issue within the
				network (false positives) leading to Alarm fatigue.</t>

        <t>Two approaches are typically followed to present the data to
				the SDD system. Classically, the aggregated data can be stored
				in a database that is polled at regular intervals by the SDD
				component for decision making. Alternatively, a streaming
				approach can be followed so as to process the data while they
				are being consumed from the collection component.</t>

        <t>For SDD per-se, two families of algorithms can be decided
				upon. First, knowledge based detection approaches can be used,
				mimicking the process that human operators follow when looking
				at the data. Machine Learning based outlier detection approaches
				to detect deviations from the norm.</t>

        <section anchor="Modeling" title="Network Modeling">
          <t>Some input to SDD is made of established knowledge of the
					network that is unrelated to the dimensions according to which
					outlier detection is performed. For example, the knowledge of
					the network infrastructure may be required to perform some
					service disruption detection. Such data need to be rendered
					accessible and updatable for use by SDD. They may come from
					inventories, or automated gathering of data from the network
					itself.</t>
        </section>

        <section anchor="Profiling" title="Data Profiling">
          <t>As rules cannot be crafted specifically for each customer,
					they need to be defined according to pre-established service
					profiles. Processing of monitoring data can be performed in
					order to associate each service with a profile. External
					knowledge on the customer can also help in associating a
					service with a profile.</t>
        </section>

        <section anchor="Strategies" title="Detection Strategies">
          <t>For a profile, a set of strategies is defined. Each
					strategy captures one approach to look at the data (as a human
					operator does) to observe if an abnormal situation is arising.
					Strategies are defined as a function of observed outliers as
					defined in <xref target="Outlier_Detection"/>.</t>

          <t>When one of the strategies applied for a profile detects a
          concerning global outlier or collective outlier, an Alarm
					needs to be raised.</t>

          <t>Depending on the implementation of the architecture, a
					scheduler may be needed in order to orchestrate the evaluation
					of the Alarm levels for each strategy applied for a profile,
					for all service instances associated with such profile.</t>
        </section>

        <section anchor="RB_vs_ML" title="Machine Learning">
          <t>Machine learning-based anomaly detection can also be
					seamlessly integrated into such SDDS. Machine learning is
					commonly used for detecting outliers or anomalies. Typically,
					unsupervised learning is widely recognized for its
					applicability, given the inherent characteristics of network
					data. Although machine learning requires a sizeable amount of
					high-quality data and considerable advanced training, the
					advantages it offers make these requirements worthwhile. The
					power of this approach lies in its generalizability,
          robustness, ability to simplify the fine-tuning process, and
					most importantly, its capability to identify anomaly patterns
					that might go unnoticed to the human observer.</t>
        </section>

        <section anchor="Storage" title="Storage">
          <t>Storage may be required to execute SDD, as some algorithms
					may be relying on historical (aggregated) monitoring data in
					order to detect anomalies. Careful considerations need to be
					made on the level at which such data is stored, as slow access
					to such data may be detrimental to the reactivity of the
					system.</t>
        </section>
      </section>

      <section anchor="Arch_Alarm" title="Alarm">
        <t>When the SDD component decides that a service is undergoing a
        disruption, an aggregated relevant-state change notification,
				taking the output of multiple Service Disruption Detection
				processes into account, needs to be sent to the Alarm and
				Problem management system as shown in Figure 4 in <xref
				section="3" sectionFormat="of"
				target="I-D.ietf-nmop-terminology"/>. Multiple practical aspects
				need to be taken into account in this component.</t>

        <t>When the issue lasts longer than the interval at which the
				SDD component runs, the relevant-state change mechanism should
				not create multiple notifications to the operator, so as to not
				overwhelm the management of the issue. However, the information
				provided along with the Alarm should be kept up to date during
				the full duration of the
        issue.</t>
      </section>

      <section anchor="Arch_Postmortem" title="Postmortem">
        <figure anchor="simplified_lifecycle"
                title="Anomaly Detection Refinement Lifecycle">
          <artwork align="center"><![CDATA[


   Network Anomaly
     Detection             Symptoms
+-------------------+         &
|   +-----------+   | Network Anomalies
|   | Detection |---|---------+
|   |   Stage   |   |         |
|   +-----------+   |         v
+---------^---------+    +-------------------+   Labels  +------------+
          |              | Anomaly Detection |---------->| Validation |
          |              |   Label Store     |<----------|   Stage    |
          |              +-------------------+  Revised  +------------+
   +------------+             |                 Labels
   | Refinement |             |
   |   Stage    |<------------+
   +------------+    Historical Symptoms
                              &
                      Network Anomalies


          ]]></artwork>
        </figure>

        <t>Validation and refinement are performed during Postmortem
				analysis.</t>

        <t>From an Anomaly Detection Lifecycle point of view, as
				described in <xref
				target="I-D.ietf-nmop-network-anomaly-lifecycle"/>, the
				Service Disruption Detection Configuration evolves over time,
        iteratively, looping over three main phases: detection,
				validation and refinement.</t>

        <t>The Detection phase produces the Alarms that are sent to the
				Alarm and Problem Management System and at the same time it
				stores the network anomaly and Symptom labels into the Label
				Store. This enables network engineers to review the labels to
				validate and edit them as needed.</t>

        <t>The Validation stage is typically performed by network
				engineers reviewing the results of the detection and indicating
				which Symptoms and network anomalies have been useful for the
				identification of Problems in the network. The original labels
				from the Service Disruption Detection are analyzed and an
				updated set of more accurate labels is provided back to the
				label store.</t>

        <t>The resulting labels will be then provided back into the
				Network Anomaly Detection via its refinement capabilities: the
				refinement is about the update of the Service Disruption
				Detection configuration in order to improve the results of the
				detection (e.g. false positives, false negatives, accuracy of
				the boundaries, etc.).</t>
      </section>

      <section anchor="Arch_Replaying" title="Replaying">
        <t>When a service disruption has been detected, it is essential
				for the human operator to be able to analyze the data which led
				to the raising of an Alarm. It is thus important that a SDDS
				preserves both the data which led to the creation of the Alarm
				as well as human understandable information on why the data led
				to the raising of an Alarm.</t>

        <t>In early stages of operations or when experimenting with a
				SDDS, it is common that the parameters used for SDD are to be
				fined tuned. This process is facilitated by designing the SDDS
				architecture in a way that allows to rerun the SDD algorithms on
				the same input.</t>

        <t>Data retention, as well as its level, need to be defined in
				order not to sacrifice the ability of replaying SDD execution
				for the sake of improving its accuracy.</t>
      </section>
    </section>

    <section anchor="Implementation" title="Implementation Status">
      <t>Note to the RFC-Editor: Please remove this section before
      publishing.</t>

      <t>This section records the status of known implementations.</t>

      <section anchor="Cosmos_Bright_Lights" title="Cosmos Bright
			Lights">
        <t>This architecture have been developed as part of a proof of
				concept started in September 2022 first in a dedicated network
				lab environment and later in December 2022 in Swisscom
				production to monitor a limited amount of 16 L3 VPN
				connectivity services.</t>

        <t>At the Applied Networking Research Workshop at IRTF 117 the
        architecture was the first time published in the following
				academic paper: <xref target="Ahf23"/>.</t>

        <t>Since December 2022, 20 connectivity service disruptions have
				been monitored and 52 false positives due to time series database
        temporarily not being real-time and missing traffic profiling,
        comparing to previous week was not applicable, occurred. Out of
				20 connectivity service disruptions 6 parameters where monitored
				and 3 times 1, 8 times 2, 6 times 3, 2 times 4 parameters
				recognized the service disruption.</t>

        <t>A real-time streaming based version has been deployed in
				Swisscom production as a proof of concept in June 2024
				monitoring approximate &gt;13'000 L3 VPN's concurrently.
				Improved profiling capabilities are currently under development.
				</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="Contributors" title="Contributors">
      <t>The authors would like to thank Alex Huang Feng, Ahmed
			Elhassany and Vincenzo Riccobene for their valuable contribution.
			</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>The authors would like to thank Qin Wu, Ignacio Dominguez
	  Martinez-Casanueva, Adrian Farrel, Reshad Rahman and Ruediger Geib
		for their review and valuable comments.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8969.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.9232.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-terminology.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-network-anomaly-semantics.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-network-anomaly-lifecycle.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-simap-concept.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.havel-nmop-digital-map.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.mackey-nmop-kg-for-netops.xml'?>
    </references>

    <references title="Informative References">
      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.4364.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.5102.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7011.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7270.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7854.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8343.xml'?>

      <reference anchor="Ahf23" target="https://hal.science/hal-04307611">
        <front>
          <title>Daisy: Practical Anomaly Detection in large BGP/MPLS and
          BGP/SRv6 VPN Networks</title>

          <author fullname="Alex Huang Feng" initials="A."
                  surname="Huang Feng"/>

          <date month="July" year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/3606464.3606470"/>

        <refcontent>IETF 117, Applied Networking Research
        Workshop</refcontent>
      </reference>

      <reference anchor="VAP09"
                 target="https://www.researchgate.net/publication/220565847_Anomaly_Detection_A_Survey">
        <front>
          <title>Anomaly detection: A survey</title>

          <author fullname="Varun Chandola" initials="V." surname="Chandola"/>

          <author fullname="Arindam Banerjee" initials="A." surname="Banerjee"/>

          <author fullname="Vipin Kumar" initials="V." surname="Kumar"/>

          <date month="July" year="2009"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/1541880.1541882"/>

        <refcontent>ACM Computing Surveys 41</refcontent>
      </reference>

      <reference anchor="Deh22"
                 target="https://www.oreilly.com/library/view/data-mesh/9781492092384/">
        <front>
          <title>Data Mesh</title>

          <author fullname="Zhamak Dehghani" initials="Z." surname="Dehghani"/>

          <date month="March" year="2022"/>
        </front>

        <seriesInfo name="ISBN" value="9781492092391"/>

        <refcontent>O'Reilly Media</refcontent>
      </reference>

      <reference anchor="W3C-RDF-concept-triples"
                 target="https://www.w3.org/TR/rdf-concepts/#section-triples">
        <front>
          <title>W3C RDF concept semantic triples</title>

          <author fullname="Richard Cyganiak" initials="R." surname="Cyganiak"/>
          <author fullname="David Wood" initials="D." surname="Wood"/>
		  <author fullname="Markus Lanthaler" initials="M." surname="Lanthaler"/>

          <date month="February" year="2014"/>
        </front>
        <refcontent>W3 Consortium</refcontent>
      </reference>
    </references>
  </back>
</rfc>
