BGP Extension for 5G Edge Service Metadata

This document describes a new Metadata Path Attribute added to a BGP UPDATE message for egress routers to advertise the Metadata about 5G low latency edge services directly attached to the egress routers. 5G is characterized by having edge services closer to the Cell Towers reachable by Local Data Networks (LDN). From IP network perspective, the 5G LDN is a limited domain with edge services a few hops away from the ingress nodes. Only selective UE services are considered as 5G low latency edge services. Note: The proposed edge service Metadata Path Attribute are not intended for the best-effort services reachable via the public internet. The information carried by the Metadata Path Attribute can be used by the ingress routers to make path selections for selective low latency services based on not only the network distance but also the running environment of the edge cloud sites. The goal is to improve latency and performance for 5G ultra-low latency services. The extension is targeted for a single domain with RR controlling the propagation of the BGP UPDATE. The edge service Metadata Path Attribute is only attached to the low latency services (routes) hosted in the 5G edge cloud sites, which are only a small subset of services initiated from UEs, not for UEs accessing many internet sites. While the proposed Metadata Path Attribute is particularly beneficial for low latency services, the metadata path attributes can be expanded to propagate information about GPU availability, power, or other resources necessary for compute-intensive services such as AI and machine learning. This flexibility makes it a valuable tool for a wide range of applications beyond just low latency services.

The following conventions are used in this document. Edge Data Center, which provides the hosting environment for the edge services. An Edge DC might host 5G core functions in addition to the frequently used edge services. next generation Node B Round-trip Time PDU Session Anchor (UPF) User Equipment User Plane Function The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC8174] when, and only when, they appear in all capitals, as shown here.

The goal of this edge service Metadata Path Attribute is for egress routers to propagate the metrics about the running environment for a subset of edge services to ingress routers so that the ingress routers can make path selections based on not only the routing cost but also the running environment for those edge services. The BGP speakers that do not support the Metadata Path Attribute can ignore the Metadata Path Attribute in a BGP UPDATE Message. All intermediate nodes can forward the entire BGP UPDATE as it is. Multiple metrics can be attached to one Metadata Path Attribute. One Metadata Path Attribute can contain computing service capability information, computing service states, computing resource states of the corresponding edge site, or more. Computing service capability information can be used to record information of the computing power node or initialization deployment information for computing service initialization. Computing service states can include one of the service connection numbers, service duration, and so on. Computing resource states can be detailed information on computing resources such as CPU/GPU. They can also be an abstract metric from these detailed parameters to indicate the resource status of the edge site. There could be more metrics about the running environment being attached to the Metadata Path Attribute, e.g., some of the metrics being discussed by the CATS WG. This document illustrates a few examples of Sub-TLVs of the metrics under the edge service Metadata Path Attribute: the site physical availability index the site preference index the service delay predication index x, and the raw load measurement. This section specifies how those Metadata impact the ingress node's path selections.

When an ingress router receives BGP updates for the same IP prefix from multiple egress routers, all these egress routers' loopback addresses are considered as the next hops for the IP prefix. For the selected low latency edge services, the ingress router BGP engine would call an edge service Management function that can select paths based on the edge service Metadata received. Section 5.1 has an exemplary algorithm to compute the weighted path cost based on the edge service Metadata carried by the Sub-TLV(s) specified in this document. Section 5 has the detailed description of the edge service Metadata influenced optimal path selection.

When the ingress router receives a packet and does a lookup on the route in the FIB, it gets the destination prefix's whole path. It encapsulates the packet destined towards the optimal egress node. For subsequent packets belonging to the same flow, the ingress router needs to forward them to the same egress router unless the selected egress router is no longer reachable. Keeping packets from one flow to the same egress router, a.k.a. Flow Affinity, is supported by many commercial routers. Most registered EC services have relatively short flows. How Flow Affinity is implemented is out of the scope for this document.

When a UE moves to a new 5G gNB which is anchored to the same UPF, the packets from the UE traverse to the same ingress router. Path selection and forwarding behavior are same as before. If the UE maintains the same IP address when anchored to a new UPF, the directly connected ingress router might use the information passed from a neighboring router to derive the optimal Next Hop for this route. The detailed algorithm is out of the scope of this document.

The Metadata Path Attribute is an optional non-transitive BGP Path attribute to carry metrics and metadata about the edge services attached to the egress router. The Metadata Path Attribute, to be assigned by IANA , consists of a set of Sub-TLVs, and each Sub-TLV contains information for specific metrics of the edge services.

Only a small subset of BGP UPDATE messages include the Metadata Path Attribute. The choice of which prefix to carry the Metadata Path Attribute is determined by local policies. The Metadata Path Attribute can be included in a BGP UPDATE message together with other BGP Path Attributes , such as Communities , NEXT_HOP, Tunnel Encapsulation Path Attribute , etc. The metadata Path Attribute has the following characteristics: Non-transitive Boundary node filtering SHOULD be deployed to remove the BGP Metadata Path attribute at the administrative boundary to prevent the distribution of the BGP Metadata Path Attribute beyond its intended scope of applicability. Can be packed with NLRI(AFI/SAFI) Unicast (1/1, 2/1), Label Unicast (AFI/SAFI - ) , IPv6 Anycast . MUST contain at least one metadata Sub-TLV. Multiple Metadata Sub-TLVs can be included in a Metadata Path Attribute in one BGP UPDATE message. The choice of the Sub-TLVs present in the BGP Metadata Path attribute is determined by the local policies. Multiple Sub-TLVs may be carried by a single BGP Metadata Path Attribute. The metrics Sub-TLVs included in the Metadata Path Attribute apply to all the address families carried in the NLRI field of the BGP UPDATE message . For a multi-protocol BGP UPDATE message , the metrics Sub-TLVs included in the Metadata Path Attribute apply to all the AFIs/SAFIs address families carried by the MP_REACH_NLRI.

A BGP speaker that advertises a path received from one of its neighbors SHOULD advertise the BGP Metadata Path attribute received with the path without modification as long as the BGP Metadata Path attribute was acceptable. If the path did not come with a BGP Metadata Path attribute, the speaker MAY attach a BGP Metadata Attribute to the path if configured to do so. A BGP Peer receiving a BGP Metadata Path attribute should ignore Sub-TLVs with unknown types and process the recognized Sub-TLVs. BGP Peers should not delete any Sub-TLV from the BGP Metadata Path Attribute.

By default, a BGP speaker does not report any unrecognized Sub-TLVs within a Metadata Path Attribute unless configured to send a notification to its management system. The ingress node should be configured with an algorithm to combine the recognized metrics carried by the Sub-TLVs within a Metadata Path Attribute of the received BGP UPDATE message.

The Metadata Path Attribute MUST contain at least one metadata Sub-TLV. Multiple Metadata Sub-TLVs can be included in a Metadata Path Attribute in one BGP UPDATE message. The content of the Sub-TLVs present in the BGP Metadata Path attribute is determined by the configuration. When a BGP Speaker does not recognize some of the Sub-TLVs within one Metadata Path Attribute in a BGP UPDATE message, the BGP Speaker should forward the received BGP UPDATE message without any change if the transitive bit is set to 1 . The domain ingress nodes should process the recognized Sub-TLVs carried by the Metadata Path Attribute and ignore the unrecognized Sub-TLVs. By default, a BGP speaker does not report any unrecognized Sub-TLVs within a Metadata Path Attribute unless configured to send a notification to its management system. The ingress node should be configured with an algorithm to combine the recognized metrics carried by the Sub-TLVs within a Metadata Path Attribute of the received BGP UPDATE message. The metrics Sub-TLVs included in the Metadata Path Attribute apply to all the address families carried in the NLRI field of the BGP UPDATE message . For a multi-protocol BGP UPDATE message , the metrics Sub-TLVs included in the Metadata Path Attribute apply to all the AFIs/SAFIs address families carried by the MP_REACH_NLRI.

Attribute flags, defined as: The high-order bit (bit 0): set to 1. The second high-order bit (bit 1): set to 0 to indicate that the Metadata Path Attribte is non-transitive. This means that a BGP speaker that does not recognize the attribute will not propagate it to other BGP peers . This non-transitive setting prevents the Metadata Path Attribute from being leaked to peers outside the domain, ensuring it remains contained within the set of BGP speakers that understand it. The third high-order bit (bit 2): same as specified by . The fourth high-order bit (bit 3): set to 0 to indicate there is one octet for the Length field . The fifth-eighth high-order bits (bit 4~7) are reserved. Metadata Path Attribute: TBD1 (assigned by IANA). Specifies the length of the value field in octets, not including the first three octets of the AttFlag, Type, and Length fields. The Length value is the total length of the Value field plus one reserved octet. For future expansion. All values in the Sub-TLVs are unsigned 32 bits integers.

Different services might have different preference index values configured for the same site. For example, Service-A requires high computing power, Service-B requires high bandwidth among its microservices, and Service-C requires high volume storage capacity. For a DC with relatively low storage capacity but high bisectional bandwidth, its preference index value for Service-B is higher and lower for Service-C. Site Preference Index can also be used to achieve stickiness for some services. It is out of the scope of this document how the preference index is determined or configured. The Preference Index Sub-TLV has the following format:

Site-Preference-Index Sub-Type =1 (specified in this document). Length: Specifies the total length in octets of the value field (not including the Type and Length fields). The Length = 5 for the Site-Preference-Index Sub-Type. Reserved: Reserved for future use. Preference Index value: 1 .. (2^32-1); the higher the value, the more preference the site. Preference Index value == 0 is reserved.

The Site Physical Availability Index indicates the percentage of impact on a group of routes associated with a common physical characteristic, for example, a pod, a row of server racks, a floor, or an entire DC. The purpose is to use one UPDATE message to indicate a group of routes of different NLRIs impacted by a physical event. For example, a power outage to a pod can cause the Site Physical Availability Index to be 0% for all the routes in the pod. Partial fiber cut to a row of shelves can cause the Site Physical Availability Index to 50% for all the routes in those shelves. The value is 0-100, with 100% indicating the site is fully functional, 0% indicating the site is entirely out of service, and 50% indicating the site is 50% degraded. It is recommended to assign each route with one Site-ID. Depending on deployment, one DC can use POD number as Site-ID, another DC can use Row of Shelves as the Site-ID. Cloud Site/Pod failures and degradation include but are not limited to, a site degradation or an entire site going down caused by a variety of reasons, such as fiber cut connecting to the site or among pods, cooling failures, insufficient backup power, cyber threats attacks, too many changes outside of the maintenance window, etc. Fiber-cut is not uncommon within a Cloud site or between sites. When those failure events happen, the edge (egress) router is running fine. Therefore, the ingress routers with paths to the egress router can't use BFD to detect the failures. When there is a failure occurring at an edge site (or a pod), many instances can be impacted. In addition, the routes (i.e., the IP addresses) in the site might not be aggregated nicely. Instead of many BGP UPDATE messages to the ingress routers for all the instances, i.e. routes, impacted, the egress router can send one single BGP UPDATE to indicate the capacity availability of the site. The ingress routers can switch all or a portion of the instances associated with the site depending on how much the site is degraded. The BGP UPDATE for the individual instances (i.e., the routes) can include the Capacity Availability Index solely for ingress routers to associate the routes with the Side-ID. The actual Capacity Availability Index value, i.e., the percentage for all the routes associated with the Side-ID, is generated by the egress routers with the egress routers' loopback address as the NLRI. The Site Physical Availability Index Sub-TLV has fixed length of 8 Octets, including the Type field. Therefore a Length field is not needed.

Indicates teh Site-Physical-Availability-Index Sub-Type=2 (Specified in this document). is a flag bit. When set to 1, the Site Availability Index is for BGP speakers (receivers) to associate the routes with the Site-ID. The Site Availability Percentage value is ignored. When set to 0, the BGP speakers (receivers) should apply the Site Availability Index value to all the routes associated with the Site-ID. Reserved for future use. The bits are set to zero upon transmission, and ignored upon reception. is an identifier for a group of routes associated with a common physical characteristic, for example, a pod, a row of server racks, a floor, or an entire DC. The purpose is to use one UPDATE message to indicate a group of routes impacted by a physical event. Those routes might be from different address families or NLRIs. There could be multiple sites connected to one egress router (a.k.a. Edge DC GW). When the RouteFlag-I is 1, the Site Availability Percentage is ignored by the Ingress routers. When the RouteFlag I is set to 0, the Site Availability Percentage represents the percentage of the site availability for all the routes associated with the Site-ID, e.g., 100%, 50%, or 0%. When a site goes dark, the Index is set to 0. 50 means 50% functioning. When the value is outside the 0-100% range, the value carried in this Sub-TLV is ignored.

An egress router sets itself as the next hop for a BGP peer before sending an UPDATE with the Metadata Path Attribute that includes the Site Physical Availability Index Sub-TLV. The Site Physical Availability Index Sub-TLV (with RouteFlag-I=1) is for ingress routers to associate the Site Identifier with the prefixes. However, it is not necessary to include the Site Physical Availability Index Sub-TLV for every BGP Update message if there is no change to the Site Identifier or the Site Physical Availability value for the prefixes.

Upon receiving a BGP update message from Router-X, containing the Metadata Path Attribute with the Site Physical Availability Index Sub-TLV, the next-hop should be the loopback of Router-X. The local BGP peer uses local policy to evaluate the route (prefix and path attributes). When the local policy processes the Metadata Attribute with the Site Physical availability, it will use the site availability index to efficiently reduce or increase the preference for all BGP routes with the Router-X next-hop (loopback). The BGP UPDATE with a standalone Site Availability Index is NOT intended for resolving NextHop.

It is desirable for an ingress router to select a site with the shortest processing time for an ultra-low latency service. But it is not easy to predict which site has "the fastest processing time" or "the shortest processing delay" for an incoming service request because: The given service instance shares the same physical infrastructure with many other applications and service instances. Service requests by other applications, UEs, or applications running behavior can impact the processing time for the given service instance. The given service instance can be served by a cluster of servers behind a Load Balancer. To the network, the service is identified by one service ID. The service complexity is different. One service may call many microservices, need to access multiple backend databases, and need to go through sophisticated security scrubbing functions, etc. Another service can be processed by a few simple steps. Without the application internal logic, it is not easy to estimate the processing time for future service requests. Even though utilization measurements, like those below, are collected by most data centers, they cannot indicate which site has the shortest processing time. A service request might be processed faster on Site-A even if Site-A is overutilized. Server utilization for the server where the instance is instantiated. The network utilization for the links to the server where the instance is instantiated. The number of databases that the service instance will access. The memory utilization of the databases The remaining available resource at a site is a more reasonable indication of process delay for future service requests. The remaining available Server resources. The remaining available network utilization for the links to the server where the instance is instantiated. The number of databases that the service instance will access. The remaining storage available for the databases. The Service Delay Prediction Index is a value that predicts processing delays at the site for future service requests. The higher the value, the longer of the delay.

While out of scope, we assume there is an algorithm that can derive the Service Delay Prediction Index that can be assigned to the egress router. When the Service Delay Prediction value is updated, which can be triggered by the available resources change, etc., the egress router can attach the updated Service Delay Predication value in a Sub-TLV under the Metadata Path Attribute of the BGP Route UPDATE message to the ingress routers.

(Service Delay Predication) Sub-type=3 (specified in this document). specifies the total length in octets of the value field, not including the sub-Type and Length field. The value of Length can be 5 or 9 depends on what format the Service Delay Prediction Vlaue uses. A single bit flag to indicate the specific condition of the Service Delay Predication Value. A single bit flag to indicate using 64-bit NTP Timestamp Format in Service Delay Prediction Value field. It is valid only when F-flag is set to 0. Reserved for future use. an integer in the range of 0-100, with 0 indicating that the service delay is negligible and 100 indicating that the site has the most significant delay compared to all other sites for the same service. When the value is outside the 0-100 range, the value carried in this Sub-TLV is ignored. the estimated delay time encoded in the NTP Format as defined in . When the L-flag is 1, then it is a 64-bit format, otherwise it is a 32-bit short format.

When data centers detailed running status are not exposed to the network operator, historic traffic patterns through the egress nodes can be utilized to predict the load to a specific service. For example, when traffic volume to one service at one data center suddenly increases a huge percentage compared with the past 24 hours average, it is likely caused by a larger than normal demand for the service. When this happens, another data center with lower-than-average traffic volume for the same service might have a shorter processing time for the same service. Here are some measurements that can be utilized to derive the Service Delay Predication for a service ID: Total number of packets to the attached service instance (ToPackets); Total number of packets from the attached service instance (FromPackets); Total number of bytes to the attached service instance (ToBytes); Total number of bytes from the attached service instance (FromBytes); The actual load measurement to the service instance attached to an egress router can be based on one of the metrics above or including all four metrics with different weights applied to each, such as: LoadIndex = w1*ToPackets+w2*FromPackes+w3*ToBytes+w4*FromBytes Where w1/w2/w3/w4 are between 0-1. w1+ w2+ w3+ w4 = 1; The weights of each metric contributing to the index of the service instance attached to an egress router can be configured or learned by self-adjusting based on user feedbacks. The Service Delay Prediction Index can be derived from LoadIndex/24Hour-Average. A higher value means a longer delay prediction. The egress router can use the ServiceDelayPred sub-TLV to indicate to the ingress routers of the delay prediction derived from the traffic pattern. Note: The proposed IP layer load measurement is only an estimate based on the amount of traffic through the egress router, which might not truly reflect the load of the servers attached to the egress routers. They are listed here only for some special deployments where those metrics are helpful to the ingress routers in selecting the optimal paths.

When ingress routers have embedded analytics tool relying on the raw measurements, it is useful for the egress router to send the raw measurement. Raw Measurement Sub-TLV has the following format:

- Raw-Measurement Sub-Type =4 (specified in this document): Raw measurements metadata from the edge service address. - Length: specifies the total length in octets of the value field, i.e., not including the Sub-Type, the Length fields. - Reserved: Reserved for future use. - Value: The value fileds can contain multiple types of sub-TLVs, which are used to describe the raw metadata. A typical raw mesurement metadata sub-TLV is defined below.

- Sub-Type =4 (specified in this document): Type = 0, Raw measurements of packets/bytes to/from the edge service address. - Length: specifies the total length in octets of the value field, i.e., not including the Sub-Type, the Length fields. The value is 22. - Reserved: Reserved for future use. - Measurement Period: BGP Update period in Seconds or user-specified period. - Total number of pakcets and bytes: The receiver nodes can compute the needed metrics, such as the Service Delay Prediction, for the Service based on the raw measurements sent from the egress node and preconfigured algorithms.

The service-oriented capability Sub-TLV is for distributing information regarding the capabilities of a specific service in a deployment environment. Depending on the deployment, a deployment environment can be an edge site or other types of environments. This information provides ingress routers or controllers with the available resources for the specific service in each deployment environment. It enables them to make well-informed decisions for the optimal paths to the selected deployment environment. Currently, the Sub-TLV only has an abstract value derived from various metrics, although the specifics of this derivation are beyond the scope of this document. Importantly, this value is significant only when comparing multiple data center sites for the same service; it is not meaningful when comparing different services, meaning the capability value relevant to Service A cannot be directly compared with that for Service B. Future enhancements may expand this sub-TLV to include more types of metrics or even raw data that represents direct metrics. This information is important in 5G network environments where efficient resource utilization is crucial for enhancing performance and service quality.

Indicates the Service-Oriented Capability Sub-type=5 (specified in this document). Specifies the total length in octets, not including the sub-Type and Length fields. The Length = 5 for the ServiceOriented Cap Sub-Type. Reserved for future use. Metric Type. This document defines a default metric type as value 0, indicating this is the normalized metric generating by multiple type of metrics. The genrating rules of the normalized metric are out of scope of this document and defined by per service. Other Metric Types could be defined by other documents in the future. The Service-Oriented Capability Abstract Value is an integer between 0 and 2**32-1. Bigger number means larger capability, and a value of 0 indicates the site has the lowest relative capability for the service. The method used to derive this value is beyond the scope of this document. Multiple Service-Oriented Capability Sub-TLVs with different metric types can be encoded in a Metadata Path Attribute, indicating that multiple metrics are carried. However, if more than one Service-Oriented Capability Sub-TLVs with the same metric type are encoded in a Metadata Path Attribute, only the first one will be processed and the others will be ignored in processing.

The "Service-Oriented Available Resource Sub-TLV" is for distributing a metric that measures the real-time avaiable resources allocated for processing specific services or applications at an edge site. This Sub-TLV complements the "Service-Oriented Capability Sub-TLV" described in Section 4.6, which addresses the static resource capability of a site for a service. While the Capability Abstract Value provides a baseline understanding of a site's potential to handle a service, the Available Resource metric offers a dynamic perspective by quantifying how much of this capacity is currently available. This distinction is crucial for managing resource efficiency and responsiveness in network operations, ensuring that capabilities are not only available but also optimally used to meet the actual service demands.

(Service-Oriented Available Resource Sub-Type) Sub-type=6 (specified in this document). Specifies the total length in octets, excluding the sub-Type and the length field. The Length is 5 for the ServiceOriented Available Resource Sub-Type. Is a single-bit Percentage flag. When it is set to 1, it indicates the value is the Service-Oriented Available Resource in percentage. When the "P" flag is set to 0, the value in this Sub-TLV is the abstract value of the available resource. Reserved for future use. Metric Type. This document defines a default metric type as value 0, indicating this is the normalized metric generating by multiple type of metrics. The genrating rules of the normalized metric are out of scope of this document and defined by the service. Other Metric Types could be defined by other documents in the future. When the P-Flag bit is set to 1, Service-Oriented Available Resource Value is a percentage (0-100), with 0 indicating that 0% of the capability is available and 100 indicating that 100% of the capability is available. When the value is outside the 0-100 range, the value carried in this Sub-TLV is ignored. For example, Capacity value is 50 and the SO-AvailRes is 50 when P-flag is set, it means 50% of 50 unit of resource is available, while 25 unit of resource is available in this site for the service. When the P-flag is 0, then the value of this filed is the abstract value of the available resource. For example, When the capacity value is 50, and the SO-AvailRes is 50, it means all the resource is available. Multiple Service-Oriented Available Resource Sub-TLVs with different metric types can be encoded in a Metadata Path Attribute, indicating that multiple metrics are carried. However, if more than one Service-Oriented Available Resource Sub-TLVs with the same metric type are encoded in a Metadata Path Attribute, only the first one will be processed and the others will be ignored in processing.

Multiple instances of the same service could be attached to one egress router. When all instances of the same service are grouped behind one application layer load balancer, they appear as one single route to the egress router, i.e., the application loader balancer's prefix. Under this scenario, the compute metrics for all those instances behind one application layer balancer are aggregated under the application load balancer's prefix. In this case, the compute metrics aggregated by the Load Balancer are visible to the egress router as associated with the Load Balancer's prefix. However, how the application layer Load Balancers distribute the traffic among different instances is out of the scope of this document. When multiple instances of the same service have different paths or links reachable from the egress router, multiple groups of metrics from respective paths could be exposed to the egress router. The egress router can have preconfigured policies on aggregating various metrics from different paths and the corresponding policies in selecting a path for forwarding the packets received from ingress routers. The aggregated metrics can be carried in the BGP Update messages instead of detailed measurements to reduce the entries advertised by the control plane and dampen the routes update in the forwarding plane. Upon receiving packets from ingress routers, the egress router can use its policies to choose an optimal path to one service instance. It is out of the scope of this document how the measurements are aggregated on egress routers and how ingress routers are configured with the algorithms to integrate the aggregated metrics with network layer metrics. Many measurements could impact and correspondingly reflect service performance. In order to simplify an optimal selection process, egress routers can have preconfigured policies or algorithms to aggregate multiple metrics into one simple one to ingress routers. Though out of the scope of this document, an egress router can also have an algorithm to convert multiple metrics to network metrics, an IGP cost for each instance, to pass to ingress nodes. This decision-making process integrates network metrics computed by traditional IGP/BGP and the service delay metrics from egress routers to achieve a well-informed and adaptive routing approach. This intelligent orchestration at the edge enhances the service's overall performance and optimizes resource utilization across the distributed infrastructure. When the egress has merged the compute metrics from the local sites behind it, it can include one or more aggregated compute metrics in the Metadata Path Attribute in the BGP UPDATE to the Ingress. Also, an identifier or flag can be carried to indicate that the metrics are merged ones. After receiving the routes for the Service ID with the identifier, the ingress would do the route selection based on pre-configured algorithms (see Section 3 of this document).

As the service metrics and network delays are in different units, here is an exemplary algorithm for an ingress router to compare the cost to reach the service instances at Site-i or Site-j. Capacity Availability Index at Site-i. A higher value means higher capacity available. Network latency measurement (RTT) to the Egress Router at the site-i. Preference Index for Site-i, a higher value means higher preference. Service Delay Predication Index at Site-i for the service, i.e., the ANYCAST address [RFC4786] for the service. Weight is a value between 0 and 1. If smaller than 0.5, Network latency and the site Preference have more influence; otherwise, Service Delay and capacity availability have more influence. When a set of service Metadata is converted to a simple metric, a decision process is determined by the metric semantics and deployment situations. The goal is to integrate the conventional network decision process with the service Metadata into a unified decision-making process for path selection.

When an ingress router receives BGP updates for the same IP address from multiple egress routers, all those egress routers are considered as the next hops for the IP address. For the selected services configured to be influenced by the edge service Metadata, the ingress router BGP Decision process [IDR-CUSTOM-DECISION] would trigger the edge service Management function to compute the weight to be applied to the route's next hop in the forwarding plane. The decision process is influenced by the edge service Metadata associated with the client routes, such as Capacity Availability Index, Site Preference, and Service Delay Prediction Index, in addition to the traditional BGP multipath computation algorithm, such as the Weight, Local preference, Origin, MED, etc., shown below:

| EdgeServiceMgn| |Decision|< - - - - - - - - | | +---^-|--+ +-------|-------+ | | BGP ANYCAST | Update Anycast | | Route | Route Nexthops | | Multi-path NH install | with weight +---|-V--+ | | RIB | | +----+---+ | | | +---V------------------------------V-------+ | Forwarding Plane | | | +------------------------------------------+ ]]> When any of those metadata value goes to 0, the effect is the same as the routes becoming ineligible via the egress router who originates the metadata UPDATE. But when any of those metadata just degrade, there is possibility, even though smaller, for the egress router to continue as the optimal next hop. Suppose a destination address for aa08::4450 can be reached by three next hops (R1, R2, R3). Further, suppose the local BGP's Decision Process based on the traditional network layer policies and metrics identifies the R1 as the optimal next hop for this destination (aa08::4450). If the edge service Metadata results in R2 as the optimal next hop for the prefix, the Forwarding Plane will have R2 as the next-hop for the destination address of aa08::4450. The edge service Metadata influencing next hop selection is different from the metric (or weight) to the next hop. The metric to a next hop can impact many (sometimes, tens of thousands) routes that have the node as their next hop. while as the edge service Metadata only impact the optimal next hop selection for a subset of client routes that are identified as the edge services. When the BGP custom decision [idr-custom-decision] is used, the edge service Management function would have algorithm to combine the edge service Metadata attributes with the custom decision to derive the optimal next hop for the Edge service routes. Note: For a BGP UPDATE message that includes the edge service Metadata Path Attribute with the RouteFlag-I=0 and the egress router's loopback prefix as the NLRI, the Site Capacity Availability Index value is applied to all the routes associated with the Site-ID.

Service Metadata are only distributed to the relevant ingress nodes interested in the Service, which can be configured or automatically formed. For each registered low-latency Service, BGP RT Constrained Distribution can be used to form the Group interested in the Service. The "Service ID", an IP address prefix, is the Route Target. When an ingress router receives the first packet of a flow destined to a Service ID (i.e., IP prefix), the ingress router sends a BGP UPDATE that advertises the Route Target membership NLRI per . The ingress router must assign a Timer for the Service ID, as the UE that uses the Service ID might move away. Upon receiving a packet destined for the Service ID, the ingress router must refresh the Timer. The ingress router must send a BGP Withdraw UPDATE for the Service ID upon expiration of the Timer. specifies SAFI=132 for the Route Target membership NLRI Advertisements.

As the metrics change can impact the path selection, the Minimum Interval for Metrics Change Advertisement is configured to control the update frequency to avoid route oscillations. Default is 30s. Significant load changes at EC data centers can be triggered by short-term gatherings of UEs, like conventions, lasting a few hours or days, which are too short to justify adjusting EC server capacities among DCs. Therefore, the load metrics change rate can be in the magnitude of hours or days.

In addition to the Error Handling procedure described in , a BGP speaker should ignore the Metadata Path Attribute if more than one Metadata Path Attribute is within one BGP Update message. The Metadata Path Attribute contains a sequence of Sub-TLVs. The Metadata Path Attribute's length minus 1 determines the total number of octets for all the Sub-TLVs under the Metadata Path Attribute. The sum of the lengths from all the Sub-TLVs under the Metadata Path Attribute plus 1 should equal the length of the Metadata Path Attribute. If this is not the case, the TLV should be considered malformed, and the "Treat-as-withdraw" procedure of is applied. When more than one sub-TLV is present in a Metadata Path Attribute, they are processed independently. Suppose a Metadata Path attribute can be parsed correctly but contains a Sub-TLV whose type is not recognized by a particular BGP speaker; that BGP speaker MUST NOT consider the attribute malformed. Instead, it MUST interpret the attribute as if that Sub-TLV had not been present. Logging the error locally or to a management system is optional. If the route carrying the Metadata path attribute is propagated with the attribute, the unrecognized Sub-TLV remains in the attribute.

The edge service Metadata described in this document are only intended for propagating between Ingress and egress routers of one single BGP domain, i.e., the 5G Local Data Networks, which is a limited domain with edge services a few hops away from the ingress nodes. Only the selective services by UEs are considered as 5G edge services. The 5G LDN is usually managed by one operator, even though the routers can be by different vendors.

The proposed edge service Metadata are advertised within the trusted domain of 5G LDN's ingress and egress routers. The ingress routers should not propagate the edge service Metadata to any nodes that are not within the trusted domain. To prevent the BGP UPDATE receivers (a.k.a. ingress routers in this document) from leaking the Metadata Path Attribute by accident to nodes outside the trusted domain , the following practice should be enforced: The Metadata Path Attribute originator sets the attribute as Non-transitive when sending the BGP UPDATE message to its correspoinding RR. According to , Non-transitive Path Attributes are only guaranteed to be dropped during BGP route propagation by implementations that do not recognize them. The RR (Route Reflector) can append the NO-ADVERTISE well-known community to the BGP UPDATE message with Metadata Path Attribute when forwarding to the ingress routers. By doing so, the Route Reflector signals to ingress nodes that the associated route's Metadata Path Attribute should not be further advertised beyond their scope. This precautionary measure ensures that the receiver of the BGP UPDATE message refrains from forwarding the received update to its peers, preventing the undesired propagation of the information carried by the Metadata Path Attribute. BGP Route Filtering or BGP Route Policies can also be used to ensure that BGP update messages with Metadata Path Attribute attached do not get forwarded out of the administrative domain. BGP route filtering allows network administrators to control the advertisements and acceptance of BGP routes, ensuring that specific routes do not leak outside the intended administrative domain. Here are the steps to achieve this: Use Route Filtering: Implement route filtering policies on the ingress routers to restrict the propagation of BGP update messages for the registered 5G edge services beyond the administrative domain. You can use access control lists (ACLs), prefix lists, or route maps to filter the BGP routes classified as the 5G edge services, which need the Metadata Path Attributes to be distributed from egress routers to ingress routers. Filter by Prefix: Use prefix filtering to specify which IP prefixes should be advertised to peers and which should be suppressed. This step ensures that only authorized routes are sent to external peers. Use Route Maps: Route maps provide a flexible way to filter and manipulate BGP route advertisements. You can create route maps to match specific conditions and then apply them to the BGP configuration.

IANA is requested to assign a new path attribute from the "BGP Path Attributes" registry. The symbolic name of the attribute is "Metadata", and the reference is [This Document].

IANA is requested to create a new sub-registry under the Metadata Path Attribute registry as follows: Sub-TLVs under the "Metadata Path Attribute" Expert Review . Detailed Expert Review procedure will be added per . [this document]

Changwang Lin New H3C Technologies China Email: linchangwang.04414@h3c.com

Acknowledgements to Jeff Hass, Tom Petch, Adrian Farrel, Alvaro Retana, Robert Raszuk, Sue Hares, Shunwan Zhuang, Donald Eastlake, Dhruv Dhody, Cheng Li, DongYu Yuan, and Vincent Shi for their suggestions and contributions.