<?xml version='1.0' encoding='utf-8'?>

<!DOCTYPE rfc [
  <!ENTITY RFC3877 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3877.xml">
  <!ENTITY RFC8632 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8632.xml">

  <!ENTITY I-D.ietf-nmop-network-anomaly-architecture SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-nmop-network-anomaly-architecture">
  <!ENTITY I-D.ietf-nmop-network-incident-yang SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-nmop-network-incident-yang">

]>

<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-ietf-nmop-terminology-06" category="info" obsoletes="" updates="" submissionType="IETF" xml:lang="en" tocInclude="true" tocDepth="3" symRefs="true" sortRefs="true" version="3">

  <front>

    <title abbrev="Network Fault Terminology">Some Key Terms for Network Fault and Problem Management</title>

    <seriesInfo name="Internet-Draft" value="draft-ietf-nmop-terminology-06"/>

    <author initials="N." surname="Davis" fullname="Nigel Davis" role="editor">
      <organization>Ciena</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <country>United Kingdom</country>
        </postal>
        <email>ndavis@ciena.com</email>
      </address>
    </author>

    <author initials="A." surname="Farrel" fullname="Adrian Farrel" role="editor">
      <organization>Old Dog Consulting</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <country>United Kingdom</country>
        </postal>
        <email>adrian@olddog.co.uk</email>
      </address>
    </author>

    <author fullname="Thomas Graf" initials="T" surname="Graf">
      <organization>Swisscom</organization>
      <address>
        <postal>
          <street>Binzring 17</street>
          <city>Zurich</city>
          <code>8045</code>
          <country>Switzerland</country>
        </postal>
        <email>thomas.graf@swisscom.com</email>
      </address>
    </author>

    <author fullname="Qin Wu" initials="Q." surname="Wu">
      <organization>Huawei</organization>
      <address>
        <postal>
          <street>101 Software Avenue, Yuhua District</street>
          <city>Nanjing</city>
          <region>Jiangsu</region>
          <code>210012</code>
          <country>China</country>
        </postal>
        <email>bill.wu@huawei.com</email>
      </address>
    </author>

    <author initials="C." surname="Yu" fullname="Chaode Yu">
      <organization>Huawei Technologies</organization>
      <address>
        <email>yuchaode@huawei.com</email>
      </address>
    </author>

    <date year="2024"/>

    <keyword>Problem</keyword>
    <keyword>Event</keyword>

    <abstract>

      <t>This document sets out some terms that are fundamental to a common understanding
         of network fault and problem management within the IETF.</t>

      <t>The purpose of this document is to bring clarity to discussions and other work
         related to network fault and problem management in particular YANG models and management protocols
         that report, make visible, or manage network faults and problems.</t>

    </abstract>

  </front>

  <middle>

    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>

      <t>Successful operation of large or busy networks depends on network management. Network management comprises a
         virtuous circle of network control, network observability, network analytics, network assurance, and back to
         network control. Network fault and problem management is an important aspect of network management and
         control solutions. It deals with the reporting, inspection, correlation, and management of events within the
         network. The intention is to focus on those events have a negative effect on the network&apos;s ability to
         forward traffic in an optimal way.  Fault and problem management extends to include actions taken to
         determine the causes of problems and to work toward recovery of optimal network behavior.</t>

      <t>A number of work efforts within the IETF seek to provide components of a fault
         management system, such as YANG models or management protocols. It is important that
         a common terminology is used so that there is a clear understanding of how the
         elements of the management and control solutions fit together, and how faults and problems will be handled.</t>

      <t>This document sets out some terms that are fundamental to a common understanding of network fault and
         problem management.  While "faults" and "problems" are concepts that apply at all levels of technology in
         the Internet, the scope of this document is restricted to the network layer and below, hence this document
         is specifically about "network fault and problem management."</t>

      <t>The terms defined in this document are principally intended for consistent use within the IETF.  Where similar
         concepts are described in other bodies, an attempt has been made to harmonize with those other descriptions, but
         there is care needed where terms are not used consistently between bodies or where terms are applied outside the
         network layer.  If other bodies find the terminology defined in this document useful, they are free to use it.</t>

      <t>Note that some useful terms are defined in <xref target="RFC3877" /> and <xref target="RFC8632" />. The
         definitions in this document are informed by those documents, but they are not dependent on that prior
         work.</t>

    </section>

    <section anchor="terms" numbered="true" toc="default">
      <name>Terminology</name>

      <t>The terms are presented below in an order that is intended to flow such that it is possible
         to gain understanding reading top to bottom.  The figures and explanations in <xref target="explain" />
         may aid understanding the terms set out here.</t>

      <dl newline="false" spacing="normal">

         <dt>System:</dt>
           <dd><t>An assembly of components that exhibits some behavior.</t></dd>

         <dt>External System:</dt>
           <dd><t>A system that includes elements that are beyond the scope of the control system.</t></dd>

         <dt>Controlled External System:</dt>
           <dd><t>An external system that is of interest to and is influenced by the control system.
                  Viewed as a collection of resources.</t></dd>

         <dt>Resource:</dt>
           <dd><t>A component, commodity, service, or capability that can be used to support the delivery
                  of some function.</t>
               <ul>
                 <li>
                   <t>Resource is a recursive concept so that a resource may be a collection of
                      other resources (for example, a network node is a collection of interfaces).</t>
                 </li>
                 <li>
                   <t>Connectivity services and network capabilities may be realized by the collection of many
                      resources, yet services and capabilities may also be recognized as resources
                      in their own right.</t>
                 </li>
               </ul></dd>

         <dt>Characteristic:</dt>
           <dd><t>Observable or measurable aspect or behavior associated with a resource.</t>
               <ul>
                 <li>
                   <t>A characteristic may be considered with respect to the concept of dimensional
                      that is built on facts (see 'value', below) and dimensions (the contexts and
                      descriptors that identify and give meaning to the facts).</t>
                 </li>
                 <li>
                   <t>The term "Metric" is another word for "Characteristic".</t>
                 </li>
               </ul></dd>

         <dt>Value:</dt>
           <dd><t>A measurable amount which may be in the form of an integer (e.g., a count) or on a
                  continuous variable (e.g., an analogue measurement) associated with a characteristic.</t></dd>

         <dt>Condition:</dt>
           <dd><t>The interpretation of the values of a set of characteristics of the resource (with
                  respect to working order or some other aspect relevant to the resource purpose/application).</t></dd>

         <dt>Change:</dt>
           <dd><t>In the context of monitoring network resources, the variation in values associated with a
                  characteristic of a resource at a specific time or over time.</t>
               <ul>
                 <li>
                   <t>Most changes are not noteworthy (i.e., are not relevant).</t>
                 </li>
                 <li>
                   <t>Perception of change depends upon detection, the sampling rate/accuracy/detail, and perspective.</t>
                 </li>
               </ul>
             </dd>

         <dt>Detect:</dt>
           <dd><t>To notice the presence of something (state, change, activity, form, etc.).</t>
               <ul>
                 <li>
                   <t>Hence also to notice a change (from the perspective of the viewer).</t>
                 </li>
               </ul>
             </dd>

         <dt>Event:</dt>
           <dd><t>The change in value (of a characteristic of a resource) at a measurable instant in
                  time (i.e., the period is negligible).</t>
               <ul>
                 <li>
                   <t>Compared with a change, which is over a period of time, an event happens at a
                      measurable instant.</t>
                 </li>
               </ul>
             </dd>

         <dt>State:</dt>
           <dd><t>A particular condition that something (e.g., a resource) is in (at a specific time).</t>
               <ul>
                 <li>
                   <t>While a state may be observed at a specific moment in time, it is actually
                      achieved by summarizing the measurement over time in a process sometimes
                      called state compression.</t>
                 </li>
               </ul>
           </dd>

         <dt>Relevance:</dt>
           <dd><t>Consideration of an event, state, or value (through the application of policy, relative
                  to a specific viewpoint/perspective, intent, and in relation to other events, states,
                  and values) to determine whether it is of note to the control system.</t></dd>

         <dt>Occurrence:</dt>
           <dd><t>A relevant event.</t>
            <t>A particular relevant change.</t>
            <ul>
              <li>
                <t>An occurrence may be an aggregation or abstraction of smaller occurrences.</t>
              </li>
              <li>
                <t>Applies to all scales and scopes, i.e., is essentially fractal (can recurse indefinitely).</t>
              </li>
              <li>
                <t>Note that occurrence is used here with respect to the temporal dimension.</t>
              </li>
            </ul>
          </dd>

         <dt>Fault:</dt>
           <dd><t>An occurrence that is not desired/required (as it may be indicative of a current or future
                  undesired State). A fault can generally be associated with a known cause. See
                  <xref target="RFC8632" /> for a more detailed discussion of network faults.</t></dd>

         <dt>Problem:</dt>
           <dd><t>A state regarded as undesirable and may require remedial action. A problem cannot
                  necessarily be associated with a cause. The resolution of a problem does not necessarily
                  act on the thing that has the problem.</t>
               <ul>
                 <li>
                   <t>Note that there is a historic aspect to the concept of a problem. The current state
                      may be operational, but there could have been a failure that is unexplained, and
                      the fact of that unexplained recent failure is a problem.</t>
                 </li>
                 <li>
                   <t>Note that whilst a problem is unresolved it may continue to require attention. A
                      record of resolved problems may be maintained in a log.</t>
                 </li>
                 <li>
                   <t>Note that there may be a state which is considered to be a problem from several
                      perspectives (e.g., a loss of light state may cause multiple services to fail).
                      A state change (so that the light recovers) may cause the problem to be resolved
                      from one perspective (the services are operational once more), but may leave the
                      problem as unresolved (because the loss of light has not been explained). There
                      could be a further development (the reason for the temporary loss of light is
                      traced to a microbend in the fiber that is repaired) resulting in that
                      unresolved problem is now resolved. But this leaves a further problem still
                      unresolved (why did the microbend occur in the first place?).</t>
                 </li>
               </ul>
             </dd>

         <dt>Incident:</dt>
           <dd><t>A network incident is an undesired occurrence such as an unexpected interruption of a
                  network service, degradation of the quality of a network service, or the below-target
                  health of a network service. An incident results from one or more problems, and a
                  problem may give rise to or contribute to one or more incidents. Greater
                  discussion of network incidents, including incident management, can be found in
                  <xref target="I-D.ietf-nmop-network-incident-yang" />.</t></dd>

         <dt>Anomaly:</dt>
           <dd><t>A (network) anomaly is an unusual or unexpected event or pattern in network data in the
                  forwarding plane, control plane, or management plane that deviates from the normal,
                  expected behavior. See <xref target="I-D.ietf-nmop-network-anomaly-architecture" />
                  for more details.</t></dd>

         <dt>Symptom:</dt>
           <dd><t>An observable characteristic/state/condition considered as an indication of a
                  problem or potential problem.</t></dd>

         <dt>Cause:</dt>
           <dd><t>The events (detected or otherwise) that gave rise to a fault/problem.</t></dd>

         <dt>Consolidation:</dt>
           <dd><t>The process of considering multiple problems, symptoms, and their causes to
                  determine the underlying causes.</t></dd>

         <dt>Alert:</dt>
           <dd><t>The indication of a fault.</t></dd>

         <dt>Alarm:</dt>
           <dd><t>Per <xref target="RFC8632" />, an alarm signifies an undesirable state in a
                  resource that requires corrective action.  From a management point of view,
                  an alarm can be as a state in its own right and the transition to this state
                  is a fault and may result in an alert being issued.  The receipt of this alert
                  may give rise to a continuous indication (to a human operator) highlighting the
                  potential or actual presence of a problem.</t></dd>

      </dl>

      <t>Two other terms may be helpful:</t>

      <dl newline="false" spacing="normal">

         <dt>Transient:</dt>
           <dd><t>A state, considered as a problem, that persists for a limited amount of time
                  before becoming resolved without direct action by an operator or control
                  system.</t></dd>
         <dt>Intermittent:</dt>
           <dd><t>A state that is not maintained, but keeps occurring in some meaningfully
                  short time frame.</t></dd>

      </dl>

    </section>

    <section anchor="explain" numbered="true" toc="default">
      <name>Workflow Explanations</name>

      <t>The relationship between system, resource, and characteristics is shown in
         <xref target="systemfig" />. A Controlled External System is comprised of
         Resources, and Resources have Characteristics.</t>

        <figure anchor="systemfig">
          <name>Relationship Between Elements of a System</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[
                Characteristics
                       ^
                       |
                    Resource
                       ^
                       |
           Controlled External System
                       ^
                       |
                External System
            ]]>
          </artwork>
        </figure>

     <t>The Value of a Characteristic of a Resource is expected to change over time. Specific
        changes in value may be noticed at a specific time (as digital changes), Detected, and
        treated as Events. This is shown on the left of <xref target="characterfig" />.</t>

     <t>The center of <xref target="characterfig" /> shows how the Value of a Characteristic
        may change over time. The value may be Detected at specific times or periodically
        and give rise to States (and consequently State changes).</t>

     <t>In practice, the Characteristic may vary in an analog manner over time as shown on the
        right hand side of <xref target="characterfig" />. The Value can be read or reported
        (i.e., Detected) periodically leading to Analogue Values that may be deemed Relevant
        Values, or may be evaluated over time as shown in <xref target="thresholdfig" />.</t>

        <figure anchor="characterfig">
          <name>Characteristics and Changes</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[
      Event                State                  Value

        ^                    ^                      ^
 Detect :             Detect :               Detect :
        :                    :                      :

   ^        ^          ^     ^     ^                   /\
   :        :          :     :     :                  /  \
   :        :          :     :     :             /\  /    \
    __    __               _____                /  \/
   |        |             |     |            /\/
 __|        |__       ____|     |____       /

Change at a time     Change over time      Change over time
            ]]>
          </artwork>
        </figure>

     <t><xref target="eventfig" /> shows the workflow progress for Events. As noted above, an
        Event is a Change in the Value of a Characteristic at a time. The Event may be
        evaluated (considering policy, relative to a specific viewpoint/perspective, with a
        view to intent, and in relation to other Events, States, and Values) to determine if
        it is an Occurrence and possibly to indicate a change of State. An Occurrence may be
        undesirable (a Fault) and that can cause an Alert to be generated, may be evidence
        of a Problem and could directly indicate a Cause.</t>

        <figure anchor="eventfig">
          <name>Events and Dependent Terms</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[

        Alert- - - - > Alarm
          ^
          |
          |     -----> Cause
          |    |
          |----------> Problem
          |
          |
        Fault
          ^
          |
          |
          |
     Occurrence
          ^
          |
          |----------> State
          |
          |
        Event
            ]]>
          </artwork>
        </figure>


     <t>Parallel to the workflow for Events, <xref target="statefig" /> shows the
        workflow progress for States. As shown in <xref target="characterfig" />,
        Change noted at a particular time gives rise to State. The State may be
        deemed relevant (via Relevance) considering policy, relative to a specific viewpoint/perspective,
        with a view to intent, and in relation to other Events, States, and Values.
        A Relevant State may be deemed a Problem, or may indicate a Problem.</t>

     <t>Problems may be considered as Symptoms and may map directly or indirectly
        to Causes. An Alarm may be raised as the result of a Problem. An Incident
        results from one or more Problems.</t>

        <figure anchor="statefig">
          <name>States and Dependent Terms</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[
        Alarm
          ^
          |     ------> Incident
          |    |
          |    |   ---> Cause
          |    |  |
      Problem---------> Symptom
          ^
          |
          |
          |
    Relevant State
          ^
          |
          |
          |
        State
            ]]>
          </artwork>
        </figure>

     <t><xref target="consolidationfig" /> shows how Faults and Problems
        may be consolidated to determine the Causes.</t>

     <t>A Cause can be indicated by or determined from Faults, Problems and Symptoms.
        It may be that one Cause points to another, and can also be considered as a
        Symptom. The determination of Causes can consider multiple inputs. An Incident
        results from one or more Problems.</t>

        <figure anchor="consolidationfig">
          <name>Consolidation of Symptoms and Causes</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[
                                      ---------
                       ------------- |         |
                      |  ----------> | Symptom |
                      | |            |         |
                      | |             ---------
                      v |                 ^
                   ---------              |
          ------->|  Cause  |<---------   |
         |         ---------           |  |
         |           ^   |             |  |
         |           |   |             |  |
         |            ---              |  |
         |                             |  |
     ---------                      ---------          ----------
    |  Fault  |------------------->| Problem |------->| Incident |
     ---------                      ---------          ----------
            ]]>
          </artwork>
        </figure>

     <t>The final figure in this section (<xref target="thresholdfig" />) shows
        how thresholds are important in the consideration of Analogue Values and
        Events. The use of threshold-driven events and states (and the alerts that
        they might give rise to) must be treated with caution to dampen any "flapping"
        (so that consistent states may be observed) and to avoid overwhelming management
        processes or systems. Analogue Values may be read or notified from the Resource
        and could transition a threshold, be deemed Relevant Values, or evaluated over
        time. Events may be counted, and the Count may cross a threshold or
        reach a Relevant Value.</t>

     <t>The Threshold Process may be implementation-specific and subject to policies.
        When a threshold is crossed and any other conditions are matched, an Event
        may be determined, and treated like any other Event.</t>

        <figure anchor="thresholdfig">
          <name>Counts, Thresholds, and Values</name>
          <artwork align="center" name="" type="" alt="">
            <![CDATA[
Occurrence
     ^
     |
     |---------------------> State
     |
     |        -------
     |------>| Count |-------------------------> Relevant Value
     |        -------          |                       ^
     |           |             |                       |
     |           |             |                       |
     |           |             v                       |
     |           |        -----------           ----------------
   Event         |       | Evaluated |         |                |
     ^           |       | over time |<--------| Analogue Value |
     |           v        -----------          |                |
     |      -----------        |               |                |
     |     | Threshold |       |               |                |
     |<----|  Process  |<------                |                |
     |     |           |<----------------------|                |
     |      -----------                         ----------------
     |                                                 ^
     |                                                 |
     | Detect                                   Detect |
     |                                                 |
Change at a Time                                Change over Time
            ]]>
          </artwork>
        </figure>

    </section>

    <section anchor="security-considerations" numbered="true" toc="default">
      <name>Security Considerations</name>

      <t>This document specifies terminology and has no direct effect on the security of
         implementations or deployments. However, protocol solutions and management models
         need to be aware of several aspects:</t>

      <ul>
        <li>
          <t>The exposure of information pertaining to faults may make available knowledge
             of the internal workings of a network (in particular its vulnerabilities) that
             may be of use to an attacker.</t>
        </li>
        <li>
          <t>Systems that generate management information (messages, notifications, etc.) when
             faults occur, may be attacked by causing them to generate so much information
             that the management system is swamped an unable to properly manage the network.</t>
        </li>
        <li>
          <t>Reporting false information about faults (or masking reports of faults) may
             cause the management system to function incorrectly.</t>
        </li>
      </ul>

    </section>

    <section anchor="privacy-considerations" numbered="true" toc="default">
      <name>Privacy Considerations</name>

      <t>In general, Fault Management should not expose information about end-user activities
         or user data. The main privacy concern is for a network operator to keep control of
         all information about faults to protect their privacy and the details of how they
         operate their network.</t>

    </section>

    <section anchor="iana-considerations" numbered="true" toc="default">
      <name>IANA Considerations</name>

      <t>This document makes no requests for IANA action.</t>

    </section>

    <section anchor="acknowledgments" numbered="false" toc="default">
      <name>Acknowledgments</name>

      <t>The authors would like to thank Med Boucadair, Wanting Du, Joe Clarke, Javier Antich, and Benoit Claise for their helpful comments.</t>

      <t>Special thanks to the team that met at a side meeting at IETF-120 to discuss some of the thorny issues:</t>
      <ul>
         <li>Benoit Claise</li>
         <li>Watson Ladd</li>
         <li>Brad Peters</li>
         <li>Bo Wu</li>
         <li>Georgios Karagiannis</li>
         <li>Olga Havel</li>
         <li>Vincenzo Riccobene</li>
         <li>Yi Lin</li>
         <li>Jie Dong</li>
         <li>Aihua Guo</li>
         <li>Thomas Graf</li>
         <li>Qin Wu</li>
         <li>Chaode Yu</li>
         <li>Adrian Farrel</li>
      </ul>

    </section>

<!--
    <section anchor="contributors" numbered="false" toc="default">
      <name>Contributors</name>

      <t>The following authors contributed significantly to this document:</t>
        <artwork name="" type="" align="left" alt="">
          <![CDATA[

          ]]>
       </artwork>

    </section>
 -->

  </middle>

  <back>

<!--
    <references>
      <name>Normative References</name>
    </references>

-->

    <references>
      <name>Informative References</name>

      &RFC3877;
      &RFC8632;

      &I-D.ietf-nmop-network-anomaly-architecture;
      &I-D.ietf-nmop-network-incident-yang;

    </references>

  </back>

</rfc>
