<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
        <!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
        <!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
        <!ENTITY RFC3277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3277.xml">
        <!ENTITY RFC3719 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3719.xml">
        <!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
        <!ENTITY RFC5120 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5120.xml">
        <!ENTITY RFC5301 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5301.xml">
        <!ENTITY RFC5303 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5303.xml">
        <!ENTITY RFC5304 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5304.xml">
        <!ENTITY RFC5305 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5305.xml">
        <!ENTITY RFC5308 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5308.xml">
        <!ENTITY RFC5309 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5309.xml">
        <!ENTITY RFC5311 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5311.xml">
        <!ENTITY RFC5316 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5316.xml">
        <!ENTITY RFC5440 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5440.xml">
        <!ENTITY RFC5449 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5449.xml">
        <!ENTITY RFC5614 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5614.xml">
        <!ENTITY RFC5837 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5837.xml">
        <!ENTITY RFC5820 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5820.xml">
        <!ENTITY RFC6232 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6232.xml">
        <!ENTITY RFC7182 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7182.xml">
        <!ENTITY RFC7356 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7356.xml">
        <!ENTITY RFC7921 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7921.xml">
        <!ENTITY RFC7981 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7981.xml">
        <!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
        ]>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc category="exp" docName="draft-ietf-lsr-distoptflood-01" ipr="trust200902">

    <!-- ***** FRONT MATTER ***** -->

    <front>

        <title>IS-IS Optimal Distributed Flooding for Dense Topologies</title>

        <author initials='R.' surname='White' fullname='Russ White'>
            <organization>Akamai</organization>
            <address>
                <email>russ@riw.us</email>
            </address>
        </author>

        <author initials='S.' surname='Hegde' fullname='Shraddha Hegde'>
            <organization>Juniper Networks</organization>
            <address>
                <email>shraddha@juniper.net</email>
            </address>
        </author>

        <author initials='T.' surname='Przygienda' fullname='Tony Przygienda'>
            <organization>Juniper Networks</organization>
            <address>
                <email>prz@juniper.net</email>
            </address>
        </author>

        <date/>

        <abstract>
            <t>In dense topologies (such as data center fabrics based on the Clos and butterfly topologies, though not
                limited to
                those exclusively), IGP flooding mechanisms designed originally
                for sparse topologies can "overflood," or in other words
                generate too many identical copies of topology and reachability
                information arriving at a given node from other devices. This normally results in slower
                convergence times and higher resource utilization to process and discard the superfluous copies.
                The modifications to the flooding mechanism in the Intermediate System to Intermediate System (IS-IS)
                link state protocol
                described in this document reduce resource utilization significantly, while increaseing convergence
                performance
                in dense topologies. Beside reducing the extraneous copies it uses the dense topologies to "load-balance"
                flooding across different possible paths in the network to prevent build up of flooding hot-spots.
            </t>

            <t>Note that a Clos fabric is used as the primary example of a dense flooding topology throughout this
                document.
                However, the flooding optimizations described in this document apply to any arbitrary topology.
            </t>

        </abstract>

    </front>

    <middle>

        <!-- 1 -->
        <section title="Introduction" toc="default">

            <!-- 2 -->
            <section title="Goals" toc="default">

                <t>The goal of this draft is to solve one of the problems occurring when operating a link state
                    protocol in a densely meshed topology. Such topologies with high average fanout,
                     causes too many copies of identical information to be flooded within the network.
                    Analysis and experiments show,
                    for instance, that in a butterfly fabric of around 2'500 intermediate systems, each intermediate
                    system will receive over 40 copies of any changed LSP fragment. This not only wastes bandwidth and
                    processor time, this dramatically slows convergence speed under topological changes.
                </t>

                <t>This document describes a set of modifications to the existing IS-IS flooding mechanisms which will
                    minimize
                    the number of LSP fragments
                    received by individual intermediate systems. In its extreme version the change leads to
                    only one copy per intermediate
                    system being processed.
                    The mechanisms described in this document are similar to and based on
                    those implemented in OSPF to support mobile
                    ad-hoc networks,
                    as described in<xref target="RFC5449"/>,<xref target="RFC5614"/>, and<xref target="RFC7182"/>.
                    These solutions
                    have been widely implemented and deployed.
                </t>

            </section> <!-- end of goals -->

            <!-- 2 -->
            <section title="Contributors" toc="default">

                <t>The following people have contributed to this draft and are mentioned without any particular
                    order: Abhishek Kumar, Nikos Triantafillis, Ivan
                    Pepelnjak, Christian Franke, Hannes Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo,
                    and Rodny Molina.
                </t>

            </section> <!-- end of contributors -->

            <section title="Experimental Evidence" toc="default">

                <t>Laboratory tests based on a well known open source codebase
                    show that modifications similar to the ones described in this draft
                    reduce flooding in a large scale emulated
                    butterfly network topology signficantly.
                    Under unmodified flooding procedurs intermediate systems receive, on average, 40 copies of any changed LSP
                    fragment in a 2'500 nodes butterfly network.
                    With the changes described in this document said systems received, on average, two
                    copies of any changed LSP fragment.
                    In many cases, only a single copy of each changed LSP was received and processed per node.
                    In terms of performance, overall convergence times were cut  in roughly half.
                </t>

                <t>An early version of mechanisms described in this document has been implemented in the FR Routing open
                    source routing stack as part of `fabricd` daemon.
                </t>

            </section> <!--end of experience -->



            <section title="Example Network" toc="default">

                <t>Following spine and leaf fabric will be used in further description of
                    the introduced modifications.</t>

                <figure align="center" anchor="is-model">
                    <artwork align="left"><![CDATA[
+====+ +====+ +====+ +====+ +====+ +====+
| 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+
]]></artwork>
                </figure>

                <t>The above picture does not contain the connections between devices for readability purposes.
                    The reader should assume that each device in a given layer
                    is connected to every device in
                    the layer above it in a butterfly network fashion. For instance:
                </t>

                <t>
                    <list style="symbols">
                        <t>5A is connected to 4A, 4B, 4C, 4D, 4E, and 4F</t>
                        <t>5B is connected to 4A, 4B, 4C, 4D, 4E, and 4F</t>
                        <t>4A is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F</t>
                        <t>4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F</t>
                        <t>etc.</t>
                    </list>
                </t>

                <t>The tiers or stages of the fabric are marked for easier reference.
                    Alternate representation of this topology is a "folded Clos" with T2 being the
                    "top of the fabric" and T0 representing the leaves.
                </t>

            </section>

        </section> <!-- End of the introduction section -->

        <!-- 1 -->
        <section title="Flooding Modifications" toc="default">

            <t>This section describes detailed modifications to the IS-IS flooding process to reduce flooding load in a
                densely meshed topology. It does at the same time distribute the reduced flooding across the whole
                topology to prevent hot-spots.
            </t>

            <!-- 2 -->
            <section title="Optimizing Flooding" toc="default">

                <t>The simplest way to conceive of the solution presented here is in two stages:</t>

                <t>
                    <ul spacing="normal">
                        <li>Stage 1: Forward Optimization
                        <t>
                        <ul spacing="normal" >
                                <li>Find the group of intermediate systems that will all flood to the same set of
                                    neighbors as the local IS
                                </li>
                                <li>Decide (deterministically) which subset of the intermediate systems within
                                    this group
                                    should re-flood any received LSPs
                                </li>
                        </ul>
                        </t>
                        </li>
                        <li>Stage 2: Reverse Optimization
                        <t>
                            <list style="symbols">
                                <t>Find neighbors on the shortest path towards the origin of the change</t>
                                <t>Do not flood towards these neighbors</t>
                            </list>
                        </t>
                        </li>
                    </ul>
                </t>

                <t>The first stage is best explained through an illustration. In the network above, if 5A transmits a
                    modified Link State Protocol Data Unit (LSP) to 4A-4F, each of 4A-4F nodes will, in turn, flood this
                    modified LSP to 3A (for instance). With this, 3A will receive 6 copies of the modified LSP,
                    while only one copy
                    is necessary for the intermediate systems shown to converge on the same view of the topology. If
                    4A-4F could determine that all of them will all flood identical copies of the modified LSP to 3A,
                    it would be possible
                    for all of them except one to decide not to flood the changed LSP to 3A.
                </t>

                <t>The technique used in this draft to determine such flooding group is for each intermediate system to
                    calculate
                    a special SPT (shortest-path spanning tree) from the point of view of the transmitting neighbor.
                    As next step, by
                    setting the metric of all links to 1 and truncating the SPT
                    to two hops, the local IS can find the
                    group of neighbors it will flood any changed LSP towards and the set of intermediate systems (not
                    necessarily neighbors) which will also flood to this same set of neighbors. If every intermediate
                    system in the flooding set performs this same calculation, they will all obtain the same flooding
                    group.
                </t>

                <t>Once such a flooding group is determined, the members of the flooding group will each (independently)
                    choose which of the members should re-flood the received information. A
                    common hash function is used across a set
                    of shared
                    variables so each member of the group comes to the same conclusion as to the designated flooding
                    nodes.
                    The group member which is in such a way `selected` to flood the changed LSP does so normally;
                    the remaining group
                    members suppress the flooding of the LSP initially.
                </t>

                <t>Note that
                    there is no signaling between the intermediate systems running this flooding reduction mechanism
                    for the solution to work.
                    Each IS calculates
                    the special, truncated SPT separately, and determines which IS should flood any changed LSPs
                    independently based on a common
                    hash function.
                    Because these calculations are performed using a shared view of the network,
                    however (based on the common link state database) and such a shared hash function, each member of
                    the flooding
                    group will make the same decision under converged conditions. In the transitory state of nodes
                    having potentially different view of topologies the flooding may either overflood or in worse case
                    not flood enough for which we introduce a 'quick-patching' mechanism later but ultimately will
                    converge due to periodic CSNP origination per normal protocol operation.
                </t>

                <t>The second stage is simpler, consisting of a single rule: do not flood modified LSPs along the
                    shortest path towards the origin of the modified LSP. This rule relies on the observation that any
                    IS between the origin of the modified LSP and the local IS should receive the modified LSP from some
                    other IS closer to the source of the modified LSP. It is worth to observe that
                    if all the nodes that should be designated
                    to flood within a peer group are pruned by the second stage the receiving node is at the `tail-end`
                    of the flooding chain and no further flooding will be necessary. Also, per normal protocol
                    procedures flooding to the node from which the LSP has been received will not be performed.
                </t>

            </section> <!-- end of optimized flooding -->

            <!-- 2 -->
            <section title="Optimization Process Details" toc="default">

                <t>
                    This section provides normative description of the specification. Any node implementing
                    this solution MUST exhibit external behavior that conforms to the algorithms provided.
                </t>
                <t>Each intermediate system will determine whether it should re-flood LSPs as described below.
                    When a modified LSP arrives from a Transmitting Neighbor (TN), the result of the following
                    algorithm obtains the necessary decision:
                </t>

                <t>Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) by:</t>
                <t>
                    <ol spacing="normal" type="%C)">
                        <li> Set all link metrics to 1</li>
                        <li> Calculate an SPT truncated to 2 hops from the perspective of TN</li>
                        <li> For each IS that is two hops away (has a metric of two in the truncated SPT) from TN:
                            <ol spacing="normal" type="%i.">
                                <li>If the IS is in a neighbor of the LSP originator, skip</li>
                                <li>If the IS is on the shortest path towards the originator of the modified LSP, skip
                                </li>
                                <li>If the IS is *not* on the shortest path towards the originator of the modified LSP, add
                                    it to THL
                                </li>
                            </ol>
                        </li>

                        <li>Add each IS that is one hop away from TN to the RNL</li>
                    </ol>
                </t>

                <t>Step 2: Sort nodes in RNL by system IDs, from the least value to the greatest.</t>

                <t>Step 3: Calculate a number, N, by adding first each byte in LSP-ID under consideration
                    (without using the fragment ID) and then adding value of its
                    fragment ID MOD 2 (footnote 1: this allows for some balancing of LSPs coming from same system ID
                    without introducing excessive amount of state in an implementation per originator).
                    Consequently, set N to the MOD of N when divided by number of neighbors in RNL.
                    With that N will be less than the number of members of RNL.
                </t>

                <t>Step 4: Starting with the Nth member of RNL:</t>

                <t>
                    <ol spacing="normal" type="%C)">
                        <li>If THL is empty, exit</li>
                        <li>If this member of RNL is the local calculating IS, it MUST reflood the modified LSP;
                            exit
                        </li>
                        <li>Remove all members of THL connected to (adjacent to) this member of RNL</li>
                        <li>Move to the next member of RNL, wrapping to the beginning of RNL if necessary</li>
                    </ol>
                </t>

                <t>Note 1: This description is leaning towards clarity rather than optimal performance when
                    implemented.</t>
                <t>
                    Note 2: An implementation in a node MAY choose independently of others
                    to provide a configurable parameter to allow for more than
                    one node in RNL to reflood, e.g. it may reflood even if it's only the member that would
                    be chosen from
                    the RNL if a double coverage of THL is required. The modifications to the algorithm
                    are simple enough to not require further text.
                </t>

            </section> <!-- end of optimization process -->

            <!-- 2 -->
            <section title="Flooding Failures" toc="default">

                <t>It is possible that during initial convergence or
                    in some failure modes the flooding will be incomplete due to the
                    optimizations outlined.
                    Specifically, if a reflooder fails, or is somehow disconnected from all the links across which it
                    should be reflooding,
                    an LSP could be only partially distributed through the topology. To speed up convergence
                    under such partition
                    failures (observe that periodic CSNPs will under any circumstances converge the topology though
                    at a slower pace),
                    an intermediate system which
                    does not reflood a specific LSP (or fragment) SHOULD:
                </t>

                <t>
                    <ol spacing="normal" type="%C)">
                        <li>Set a short, configurable timer which should be significantly shorter than CSNP interval used.</li>
                        <li>When the timer expires, send Partial Sequence Number Packet (PSNP) of all LSPs that have *not*
                            been reflooded during the timer runtime to all neighbors unless an up-to-date PSNP or
                            CSNP has been already received from the neighbor.
                        </li>
                        <li>Per normal protocol procedures process any Partial Sequence Number Packets (PSNPs)
                            received that indicate that neighbors
                            still have older versions of the LSP will lead to the usual
                            synchronization of the databases that are out of sync due to optimized flooding.
                        </li>
                        <li>If such resynchronizations above a configurable threshold are required (i.e. PSNPs
                            are sent to the neighbors and are answered with requests),
                            an implementation SHOULD notify the network operator via the according mechanism
                            about the condition.
                        </li>
                    </ol>
                </t>

            </section> <!-- end of flooding failures -->

            <section title="Signaling" toc="default">
                <t>
                    A node deploying this algorithm SHOULD advertise algorithm value &lt;TBD&gt; in the IS-IS
                    Dynamic Flooding sub-TLV of the Router Capability TLV (242) <xref
                        target="RFC7981"/> as specified in <xref target="I-D.ietf-lsr-dynamic-flooding"></xref>. It
                    bares repeating again that in case the hashing algorithm a node uses is different from this
                    draft a different algorithm number must be assigned and used.
                </t>
            </section>

            <section title="Additional Deployment Considerations" toc="default">
                <t>
                    A node deploying this algorithm on point-to-point links MUST send CSNPs on such links. This does not
                    represent a dramatic change given most deployed implementations today already exhibit this
                    behavior to prevent possible slow synchronization of IS-IS database across such links and to provide
                    additional periodic consistency guarantees.
                </t>
            </section>
            <!-- 2 -->
            <section title="Flooding Example" toc="default">

                <t>Assume, in the network specified, that 5A floods some modified LSP towards 4A-4F and
                    we only use a single node to reflood.
                    To determine whether 4A
                    should flood this LSP to 3A-3F:
                </t>
                <t>
                    <list style="symbols">
                        <t>5A is TN; 4A calculates a truncated SPT from 5A's perspective with all link metrics set to
                            1
                        </t>
                        <t>4A builds THL, which contains 3A, 3B, 3C, 3D, 3E, 3F, 5B, 5C, 5D, 5E and 5F</t>
                        <t>4A builds RNL, which contains 4A,4B,4C,4D,4E and 4F, sorting it by the system ID</t>
                        <t>4A computes hash on the received LSP-ID to get N; assume N is 1 in this case</t>
                        <t>Since 4A is the 1st member of RNL and there are members in THL, 4A must reflood; the loop
                            exits
                        </t>
                    </list>
                </t>

            </section> <!-- end of flooding example -->

            <!-- 2 -->
            <section title="A Note on Performance" toc="default">

                <t>The calculations described here seem complex, which might lead the reader to conclude that the cost of
                    calculation is so much higher than the cost of flooding that this optimization is
                    counter-productive.
                    First, The description provided here is designed for clarity rather than optimal calculation.
                    Second, many of the involved calculations can be easily performed in advance and stored,
                    rather than being performed for
                    each
                    LSP occurence and each neighbor. Optimized versions of the process described here have
                    been implemented,
                    and do result in strong convergence speed gains.
                </t>

            </section> <!-- end of performance note -->

        </section> <!-- end of optimizing flooding -->

        <!-- 1 -->
        <section title="Security Considerations" toc="default">

            <t>This document outlines modifications to the IS-IS protocol for operation on high density network
                topologies. Implementations SHOULD implement IS-IS cryptographic authentication, as described in <xref
                        target="RFC5304"/>, and should enable other security measures in accordance with best common
                practices for the IS-IS protocol.
            </t>

        </section> <!-- end of security considerations -->

    </middle>

    <back>

        <references title="Normative References">

            &RFC2119;
            &RFC2629;
            &RFC5120;
            &RFC5301;
            &RFC5303;
            &RFC5305;
            &RFC5308;
            &RFC5309;
            &RFC5311;
            &RFC5316;
            &RFC7356;
            &RFC7981;
            &RFC8174;

            <?rfc include="reference.I-D.ietf-lsr-dynamic-flooding.xml"?>

            <reference anchor="ISO10589">
                <front>
                    <title>Intermediate system to Intermediate system intra-domain
                        routeing information exchange protocol for use in conjunction with
                        the protocol for providing the connectionless-mode Network Service
                        (ISO 8473)
                    </title>

                    <author>
                        <organization abbrev="ISO">International Organization for Standardization</organization>
                    </author>

                    <date month="Nov" year="2002"/>
                </front>

                <seriesInfo name="ISO/IEC" value="10589:2002, Second Edition"/>
            </reference>

        </references> <!-- end of normative references -->

        <references title="Informative References">

            &RFC3277;
            &RFC3719;
            &RFC4271;
            &RFC5304;
            &RFC5440;
            &RFC5449;
            &RFC5614;
            &RFC5820;
            &RFC5837;
            &RFC6232;
            &RFC7182;
            &RFC7921;

            <?rfc include="reference.I-D.ietf-isis-segment-routing-extensions.xml"?>

        </references> <!-- end of informative references -->

    </back>

</rfc>
