<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type='text/xsl' href='rfc7749.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->

<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc
    xmlns:xi="http://www.w3.org/2001/XInclude"
    category="std"
    docName="draft-ietf-bess-evpn-mh-pa-06"
    consensus="true"
    submissionType="IETF"
    ipr="trust200902"
    tocInclude="true"
    tocDepth="4"
    symRefs="true"
    sortRefs="true">

<!-- ***** FRONT MATTER ***** -->
 <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title abbrev="EVPN MH Port-Active">EVPN multi-homing port-active load-balancing</title>
    <seriesInfo name="Internet-Draft" value="draft-ietf-bess-evpn-mh-pa-06"/>
    <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

   <author fullname="Patrice Brissette" initials="P." role="editor" surname="Brissette">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street/>
          <city>Ottawa</city>
          <region>ON</region>
          <country>Canada</country>
        </postal>
        <phone/>
        <email>pbrisset@cisco.com</email>
      </address>
    </author>
    <author fullname="Ali Sajassi" initials="A." surname="Sajassi">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>USA</country>
        </postal>
        <email>sajassi@cisco.com</email>
      </address>
    </author>
    <author fullname="Luc Andre Burdet" initials="LA." role="editor" surname="Burdet">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>Canada</country>
        </postal>
        <email>lburdet@cisco.com</email>
      </address>
    </author>
    <author fullname="Samir Thoria" initials="S." surname="Thoria">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>USA</country>
        </postal>
        <email>sthoria@cisco.com</email>
      </address>
    </author>
    <author fullname="Bin Wen" initials="B." surname="Wen">
      <organization>Comcast</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>USA</country>
        </postal>
        <email>Bin_Wen@comcast.com</email>
      </address>
    </author>
    <author fullname="Edward Leyton" initials="E." surname="Leyton">
      <organization>Verizon Wireless</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>USA</country>
        </postal>
        <email>edward.leyton@verizonwireless.com</email>
      </address>
    </author>
    <author fullname="Jorge Rabadan" initials="J." surname="Rabadan">
      <organization>Nokia</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>USA</country>
        </postal>
        <email>jorge.rabadan@nokia.com</email>
      </address>
    </author>

    <date year="2022"/>

   <!-- Meta-data Declarations -->
   <area>General</area>
    <workgroup>BESS Working Group</workgroup>
    <keyword>Port-Active</keyword>
    <keyword>EVPN</keyword>
    <keyword>Multi-homing</keyword>

   <abstract>
      <t>The Multi-Chassis Link Aggregation Group (MC-LAG) technology enables
   establishing a logical link-aggregation connection with a
   redundant group of independent nodes. The purpose of multi-chassis
   LAG is to provide a solution to achieve higher network availability,
   while providing different modes of sharing/balancing of traffic.
   RFC7432 defines EVPN based MC-LAG with single-active and all-active
   multi‑homing load‑balancing mode. The current draft expands on
   existing redundancy mechanisms supported by EVPN and introduces
   support for port-active load‑balancing mode.</t>
    </abstract>
  </front>
  <middle>
    <section>
      <name>Introduction</name>
      <t>EVPN, as per <xref target="RFC7432"/>, provides all-active per flow load‑balancing
     for multi‑homing. It also defines single-active with service carving
     mode, where one of the PEs, in redundancy relationship, is active per
     service. </t>
      <t>While these two multi‑homing scenarios are most widely utilized in
     data center and service provider access networks, there are scenarios
     where active-standby per interface multi‑homing load‑balancing is useful
     and required. The main consideration for this mode of load‑balancing is
     the determinism of traffic forwarding through a specific interface
     rather than statistical per flow load‑balancing across multiple PEs
     providing multi‑homing. The determinism provided by active-standby
     per interface is also required for certain QOS features to work.
     While using this mode, customers also expect minimized convergence
     during failures.</t>
      <t>A new type of load‑balancing mode, port-active load‑balancing, is defined.
     This draft describes how the new load‑balancing mode can be supported
     via EVPN. The new mode may also be referred to as per interface
   active/standby.</t>
      <figure anchor="Topology">
        <name>MC-LAG Topology</name>
        <artwork align="left"><![CDATA[

                 +-----+
                 | PE3 |
                 +-----+
              +-----------+
              |  MPLS/IP  |
              |  CORE     |
              +-----------+
            +-----+   +-----+
            | PE1 |   | PE2 |
            +-----+   +-----+
               |         |
               I1       I2 
                 \     /
                  \   /
                  +---+
                  |CE1|
                  +---+
       ]]></artwork>
      </figure>
      <t><xref target="Topology"/> shows a MC-LAG multi‑homing topology where PE1 and PE2 are
     part of the same redundancy group providing multi‑homing to CE1 via
     interfaces I1 and I2. Interfaces I1 and I2 are members of a LAG running LACP
     protocol. The core, shown as IP or MPLS
     enabled, provides wide range of L2 and L3 services. MC-LAG multi‑homing
     functionality is decoupled from those services in the core and
     it focuses on providing multi‑homing to the CE. With per-port
     active/standby load‑balancing, only one of the two interface I1 or I2
     would be in forwarding, the other interface will be in standby. This
     also implies that all services on the active interface are in active
     mode and all services on the standby interface operate in standby
     mode.</t>
      <section anchor="requirements">
      <!-- anchor is an optional attribute -->
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
          RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119"/>
          <xref target="RFC8174"/> when, and only when, they appear in
          all capitals, as shown here.</t>
      </section>
    </section>

    <section>
      <name>Multi-Chassis Link Aggregation</name>
      <t>When a CE is multi‑homed to a set of PE nodes using the <xref target="IEEE.802.1AX_2014"/>
   Link Aggregation Control Protocol (LACP), the PEs must act as if they
   were a single LACP speaker for the Ethernet links to form and operate as a Link Aggregation Group (LAG). To achieve this, the
   PEs connected to the same multi‑homed CE must synchronize LACP
   configuration and operational data among them. Interchassis
   Communication Protocol (ICCP) <xref target="RFC7275"/> has been used for that purpose.
   EVPN LAG simplifies greatly that solution. Along with the
   simplification come a few assumptions:</t>
      <ul spacing="normal">
        <li>a CE device connected to multi‑homing PEs may have a single LAG with
     all its active links i.e. links in the LAG operate in
     all-active load‑balancing mode.</li>
        <li>Same LACP parameters MUST be configured on peering PEs such as
     system id, port priority and port key.</li>
      </ul>
      <t>Any discrepancies from this list are out of the scope of this document, as are mis-configuration and mis-wiring detection across
   peering PEs.</t>
    </section>

    <section>
      <name>Port-active Load-balancing Procedure</name>
      <t>Following steps describe the proposed procedure with EVPN LAG to
   support port-active load‑balancing mode:</t>

      <ol spacing="normal" type="a">
        <li>The Ethernet-Segment Identifier (ESI) MUST be assigned per access
        interface as described in  <xref target="RFC7432"/>, which may be auto derived or
        manually assigned. Access interface MAY be a Layer‑2 or Layer‑3
        interface. The usage of ESI over Layer‑3 interface is newly described in
        this document.</li>

        <li>Ethernet-Segment (ES) MUST be configured in port-active load‑balancing
        mode on peering PEs for specific access interface.</li>

        <li>Peering PEs MAY exchange only Ethernet-Segment (ES) route (Route Type‑4)
        when ESI is configured on a Layer‑3 interface.</li>

        <li>PEs in the redundancy group leverage the DF election defined in
        <xref target="RFC8584"/> to determine which PE keeps the port in active mode and
        which one(s) keep it in standby mode.  While the DF election defined
        in <xref target="RFC8584"/> is per [ES, Ethernet Tag] granularity, for port-active
         mode of multi‑homing, the DF election is done per &lt;ES&gt;.  The details
        of this algorithm are described in <xref target="df_algo"/>. </li>
        
        <li>DF router MUST keep corresponding access interface in up and
        forwarding active state for that Ethernet-Segment</li>
        
        <li>Non-DF routers will by default implement a bidirectional blocking scheme for all
        traffic in line with <xref target="RFC7432"/> Single-Active blocking scheme, albeit across all VLANS.
            <ul>
            <li>Non-DF routers MAY bring and keep peering access interface attached to it in
            operational down state.</li>
            <li>If the interface is running LACP protocol, then the non-DF PE MAY also set the LACP state to OOS
            (Out of Sync) as opposed to interface state down. This allows for
            better convergence on standby to active transition.</li>
            </ul>
        </li>
        
        <li>For EVPN-VPWS service, the usage of primary/backup bits of EVPN
     Layer‑2 attributes extended community <xref target="RFC8214"/> is highly recommended
     to achieve better convergence.</li>
      </ol>
    </section>

    <section anchor="df_algo">
      <name>Designated Forwarder Algorithm to Elect per Port-active PE</name>
      <t>The ES routes, running in port-active load‑balancing mode, are
      advertised with the new Port Mode Load-Balancing capability in the DF Election Extended
      Community defined in <xref target="RFC8584"/>. Moreover, the ES associated to the
      port leverages existing procedure of Single-Active, and signals
      Single-Active Multihomed site redundancy mode along with Ethernet-AD per-ES route
      (<relref target="RFC7432" section="7.5"/>). Finally the ESI-label based
      split-horizon procedures in <relref target="RFC7432" section="8.3"/> should be used
      to avoid transient echo'ed packets when Layer‑2 circuits are involved.</t>

      <t>The various algorithms for DF Election are discussed in Sections <xref target="modulo"
      format="counter"/> to <xref target="ac_df" format="counter"/> for completeness, although the choice of algorithm
      in this solution doesn't affect complexity or performance as in other load-balancing modes.</t>

      <section anchor="cap_flag">
        <name>Capability Flag</name>
        <t>
        <xref target="RFC8584"/> defines a DF Election extended community, and a Bitmap
       field to encode "capabilities" to use with the DF election algorithm
       in the DF algorithm field. Bitmap (2 octets) is extended by the
     following value:</t>

<figure anchor="Bitmap">
          <name>Amended Bitmap field in the DF Election Extended Community</name>
          <artwork align="left"><![CDATA[

                         1 1 1 1 1 1
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |D|A|     |P|                   |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       ]]></artwork>
        </figure>

    <dl newline="false" spacing="normal" indent="10">
          <dt>Bit 0:</dt>   <dd>D bit or 'Don't Preempt' bit, as explained in <xref target="I-D.ietf-bess-evpn-pref-df"/>.</dd>
          <dt>Bit 1:</dt>   <dd>AC-DF Capability (AC-Influenced DF election), as explained in <xref target="RFC8584"/>.</dd>
          <dt>Bit 5:</dt>   <dd>(corresponds to Bit 29 of the DF Election Extended 
             Community and it is defined by this document): 
             'Port Mode Load-Balancing' Capability (P bit hereafter), determines 
             that the DF-Algorithm should be modified to consider 
             the port ES only and not the Ethernet Tags.</dd>
        </dl>
      </section>
      <section anchor="modulo">
        <name>Modulo-based Algorithm</name>
        <t>The default DF Election algorithm, or modulus-based algorithm as in
      <xref target="RFC7432"/> and updated by <xref target="RFC8584"/>, is used here, at the granularity
     of ES only. Given that ES-Import Route Target extended community may be auto-derived and
     directly inherits its auto-derived value from ESI bytes 1-6, many operators differentiate ESI
     primarily within these bytes.
     As a result, bytes 3‑6 are used to determine the designated forwarder using Modulo-based DF
     assignment, achieving good entropy during Modulo calculation across ESIs:<br/>
     Assuming a redundancy group of N PE nodes, the PE with ordinal i is the DF for an &lt;EE&gt;
     when (Es mod N) = i, where Es represents bytes 3‑6 of that ESI. </t>
      </section>
      <section anchor="hrw">
        <name>HRW Algorithm</name>
        <t>
       Highest Random Weight (HRW) algorithm defined in <xref target="RFC8584"/> MAY also
       be used and signaled, and modified to operate at the granularity of
       &lt;ES&gt; rather than per &lt;ES, VLAN&gt;. </t>
       
       <t><relref target="RFC8584" section="3.2"/> describes computing a 32 bit CRC over the concatenation of
       Ethernet Tag and ESI. For port-active load‑balancing mode, the
       Ethernet Tag is simply removed from the CRC computation.</t>

       <t>DF(Es) denotes the DF and BDF(Es) denote the BDF for the ESI es; Si is the IP address of PE i; and Weight is a function of Si, and Es.</t>
       <ol>
           <li>DF(Es) = Si| Weight(Es, Si) >= Weight(Es, Sj), for all j.
           In the case of a tie, choose the PE whose IP address is
           numerically the least.  Note that 0 &lt;= i,j &lt; number of PEs in the
           redundancy group.</li>

            <li>BDF(Es) = Sk| Weight(Es, Si) &gt;= Weight(Es, Sk), and
            Weight(Es, Sk) &gt;= Weight(Es, Sj).  In the case of a tie,
            choose the PE whose IP address is numerically the least.</li>
       </ol>
       <t>Where:</t>
       <ul>
           <li>DF(Es) is defined to be the address Si (index i) for which
           Weight(Es, Si) is the highest; 0 &lt;= i &lt; N-1.</li>

           <li>BDF(Es) is defined as that PE with address Sk for which the
           computed Weight is the next highest after the Weight of the DF.
           j is the running index from 0 to N-1; i and k are selected values.</li>
       </ul>
      </section>
      <section anchor="pref_df">
        <name>Preference-based DF Election</name>
        <t> When the new capability 'Port-Mode' is signaled, the algorithm is
       modified to consider the port only and not any associated Ethernet
       Tags. Furthermore, the "port-based" capability MUST be compatible
       with the "Don't Preempt" bit. When an interface recovers,
       a peering PE signaling D-bit will enable non-revertive behaviour at
       the port level. </t>
      </section>
      <section anchor="ac_df">
        <name>AC-Influenced DF Election</name>
        <t>The AC-DF bit MUST be set to 0 when advertising Port Mode Load-Balancing capability
        (P=1).
        When an AC (sub-interface) goes down, it does not influence the DF election. The peer's
        Ethernet A-D per EVI is ignored in all Port Mode DF Election algorthms.</t>
        <t>Upon receiving AC-DF bit set (A=1) from a remote PE, it MUST be ignored when performing
        Port-Mode DF Election.</t>
      </section>
    </section>
    <section>
      <name>Convergence considerations</name>
      <t>To improve the convergence, upon failure and recovery, when port‑active
   load‑balancing mode is used, some advanced synchronization
   between peering PEs may be required. Port-active is challenging in a
   sense that the "standby" port is in down state. It takes some time to
   bring a "standby" port in up-state and settle the network. For IRB
   and L3 services, ARP / ND cache may be synchronized. Moreover,
   associated VRF tables may also be synchronized. For L2 services, MAC
   table synchronization may be considered.</t>
      <t>Finally, for members of a LAG running LACP the
   ability to set the "standby" port in "out-of-sync" state a.k.a "warm‑standby"
   can be leveraged.</t>
      <section>
        <name>Primary / Backup per Ethernet-Segment</name>
        <t>The EVPN Layer 2 Attributes Control Flags extended community SHOULD be advertised in Ethernet A-D per ES route
        for fast convergence.</t>
        <t>Only the P and B bits are relevant to this document, and only in the context of
        Ethernet A-D per ES routes:</t>
        <ul>
            <li>When advertised, the EVPN Layer 2 Attributes Control Flags extended community SHALL have
            only P or B bits set and all other bits and fields MUST be zero.</li>
            <li>A remote PE receiving the optional EVPN Layer 2 Attributes Control Flags extended community in Ethernet A-D per ES routes
            SHALL consider only P and B bits.</li>
        </ul>
     <t>For EVPN Layer 2 Attributes Control Flags extended community sent and received in Ethernet A-D per EVI
     routes used in <xref target="RFC8214"/>, <xref target="RFC7432"/> and <xref
            target="I-D.ietf-bess-evpn-vpws-fxc"/>:</t>
        <ul>
            <li>P and B bits received are overridden by "parent" bits on Ethernet A-D per ES above.</li>
            <li>Other fields and bits of the extended community are used according to the procedures
            of those documents.</li>
        </ul>
      </section>
      <section>
        <name>Backward Compatibility</name>
        <t>Implementations that comply with <xref target="RFC7432"/> or <xref target="RFC8214"/> only (i.e., implementations 
    that predate this document) will not advertise the EVPN Layer 2 Attributes Control Flags extended community
    in Ethernet A-D per ES routes.  That means that all remote PEs in the ES will
    not receive P and B bit per ES and will continue to receive and honour
    the P and B bits received in Ethernet A-D per EVI route(s).
    Similarly, an implementation that complies with <xref target="RFC7432"/> or <xref target="RFC8214"/> only and
    that receives an EVPN Layer 2 Attributes Control Flags extended community will ignore it and will continue
    to use the default path resolution algorithm.</t>
      </section>
    </section>
    <section>
      <name>Applicability</name>
      <t>A common deployment is to provide L2 or L3 service on the PEs
   providing multi‑homing. The services could be any L2 EVPN such as
   EVPN VPWS, EVPN  <xref target="RFC7432"/>, etc. L3 service could be in VPN context
   <xref target="RFC4364"/> or in global routing context. When a PE provides first hop
   routing, EVPN IRB could also be deployed on the PEs. The mechanism
   defined in this document is used between the PEs providing L2 and/or
   L3 services, when per interface single-active load‑balancing is desired.</t>
      <t>A possible alternate solution is the one described in this draft is
   MC-LAG with ICCP <xref target="RFC7275"/> active-standby redundancy. However, ICCP
   requires LDP to be enabled as a transport of ICCP messages. There are
   many scenarios where LDP is not required e.g. deployments with VXLAN
   or SRv6. The solution defined in this draft with EVPN does not
   mandate the need to use LDP or ICCP and is independent of the
   underlay encapsulation.</t>
    </section>
    <section>
      <name>Overall Advantages</name>
      <t>The use of port-active multi‑homing brings the following benefits to
   EVPN networks:</t>
      <ol spacing="normal" type="a"><li>Open standards based per interface single-active load‑balancing
     mechanism that eliminates the need to run ICCP and LDP (e.g. they may be running VXLAN or SRv6 in the
     network).</li>
        <li>Agnostic of underlay technology (MPLS, VXLAN, SRv6) and associated
     services (L2, L3, Bridging, E-LINE, etc).</li>
        <li>Provides a way to enable deterministic QOS over MC-LAG attachment
     circuits.</li>
        <li>Fully compliant with  <xref target="RFC7432"/>, does not require any new protocol
     enhancement to existing EVPN RFCs.</li>
        <li>Can leverage various DF election algorithms e.g. modulo, HRW, etc.</li>
        <li>
          <t>Replaces legacy MC-LAG ICCP-based solution, and offers following
     additional benefits:

          </t>
          <ul spacing="normal">
            <li>Efficiently supports 1+N redundancy mode (with EVPN using BGP
       RR) where as ICCP requires full mesh of LDP sessions among PEs in
       redundancy group. </li>
            <li>Fast convergence with mass-withdraw is possible with EVPN, no
       equivalent in ICCP.</li>
          </ul>
        </li>
      </ol>
    </section>
    <section anchor="IANA">
      <name>IANA Considerations</name>
      <t>This document solicits the allocation of the following values:</t>
      <ul spacing="normal">
        <li>Bit 5 in the  <xref target="RFC8584"/> DF Election Capabilities registry, 
     with name "P" for Port Mode Load-Balancing.</li>
      </ul>
    </section>
    <section>
      <name>Security Considerations</name>
      <t>The same Security Considerations described in <xref target="RFC7432"/> and  <xref target="RFC8584"/> are valid for this
      document.</t>
      

      <t>By introducing a new capability, a new requirement for unanimity (or lack thereof) between PEs is
      added. Without consensus on the new DF election procedures and Port Mode,
      the DF election algorithm falls back to the default DF
      election as provided in <xref target="RFC8584"/> and <xref target="RFC7432"/>.
      This behavior could be exploited by an attacker that manages to modify the configuration of one PE in
      the ES so that the DF election algorithm and capabilities in all the
      PEs in the ES fall back to the default DF election.  If that is the
      case, the PEs will be exposed to the same unfair load balancing, service
      disruption, and possibly black-holing or duplicate traffic mentioned in those documents and their security
      sections.
      </t>
    </section>
    <section>
      <name>Acknowledgements</name>
      <t>The authors thank Anoop Ghanwani for his comments and suggestions and Stephane Litkowski
      for his careful review.</t>
    </section>
    <!-- Possibly a 'Contributors' section ... -->


</middle>
  <!--  *****BACK MATTER ***** -->
<back>
    <!-- References split into informative and normative -->

   <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7432.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8214.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8584.xml"/>
        <?rfc include='reference.I-D.draft-ietf-bess-evpn-pref-df-08.xml' ?>

</references>

<references>
        <name>Informative References</name>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.4364.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7275.xml"/>
        <?rfc include='reference.I-D.draft-ietf-bess-evpn-vpws-fxc-05.xml' ?>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml6/reference.IEEE.802.1AX_2014.xml"/>
      
      </references>
    </references>
  </back>
</rfc>
