CNI Telco-Cloud Benchmarking Considerations

CNI Telco-Cloud Benchmarking Considerations fortiss GmbH

Guerickestr. 25 Munich 80805 DE samizadeh@fortiss.org

ATHENA RC

University Campus South Entrance Xanthi 67100 Greece George.Koukis@athenarc.gr

fortiss GmbH

Guerickestr. 25 Munich 80805 DE sofia@fortiss.org www.rutesofia.com

University of Macedonia

Egnatias 156 Thessaloniki 54636 Greece emamatas@uom.edu.gr

ATHENA RC

University Campus South Entrance Xanthi 67100 Greece vassilis.tsaoussidis@gmail.com

Operations and Management Area Benchmarking Methodology Working Group Internet-Draft CNI SDN Edge-Cloud This document investigates benchmarking methodologies for Kubernetes Container Network Interfaces (CNIs) in Edge-to-Cloud environments. It defines performance, scalability, and observability metrics relevant to CNIs, and aligns with the goals of the IETF Benchmarking Methodology Working Group (BMWG). The document surveys current practices, introduces a repeatable benchmarking frameworks (e.g., CODEF), and proposes a path toward standardized, vendor-neutral benchmarking procedures for evaluating CNIs in microservice-oriented, distributed infrastructures.

Introduction This document presents an initial exploration of benchmarking methodologies for Kubernetes Container Network Interfaces (CNIs) in Edge-to-Cloud environments. It evaluates the performance characteristics of common Kubernetes networking plugins such as Multus, Calico, Cilium, and Flannel within the scope of container orchestration platforms. The draft aims to align with the principles of the IETF Benchmarking Methodology Working Group (BMWG) by proposing a framework for repeatable, comparable, and vendor-neutral benchmarking of CNIs. Emphasis is placed on performance aspects relevant to Software Defined Networking (SDN) architectures and distributed deployments. The goal is to inform the development of formal benchmarking procedures tailored to CNIs in heterogeneous infrastructure scenarios.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Problem Statement and Alignment with BMWG Goals BMWG proposes and debates methodologies and metrics to evaluate performance characteristics of networking devices and systems in a repeatable, vendor-neutral, and interoperable manner. While multiple Kubernetes CNI solutions exist and are critical to Kubernetes networking-and by extension, to telco-cloud networking-there is currently no standardized methodology for benchmarking their performance, resource utilization, or behavior under varying operational conditions. The absence of such standards leads to non-reproducible, vendor-specific results that are difficult to compare or rely on for deployment decisions in edge-cloud contexts. This document aligns with BMWG goals by proposing benchmarking considerations for Kubernetes Container Network Interface (CNI) plugins that adhere to the following principles:

Repeatability and Reproducibility: The draft emphasizes deterministic test environments by leveraging clean-slate container orchestration through automation frameworks such as the experimental open-source Cognitive Decentralised Edge Cloud (CODECO) and the Experimentation Framework (CODEF) . Test cases are repeatable across deployments, and variability in underlying infrastructure (e.g., bare metal vs. virtualized environments) is explicitly documented to preserve reproducibility, following BMWG best practices and .
Vendor-Neutral Evaluation: The proposed approach includes a diverse set of CNIs from multiple vendors and open-source communities, avoiding platform-specific optimizations. CNIs are evaluated under the same environmental and workload conditions to provide fair comparisons, consistent with BMWG's commitment to vendor-agnostic test procedures.
Metrics-Based Assessment: The document adopts classical benchmarking metrics including latency, throughput, jitter, and resource consumption (CPU, memory), extending them with CNI-relevant attributes such as pod network initialization time and observability overhead. These metrics are aligned with performance evaluation goals outlined in , , and more recent benchmarking efforts for virtualized environments .
Applicability to Emerging Architectures: The targeted environment includes Edge-to-Cloud deployments, which represent modern distributed system architectures. While BMWG has historically focused on network appliances, this work extends those principles to the networking aspects of containerized and software-defined infrastructures, continuing the evolution of benchmarking methods to address dynamic, microservice-based platforms.
Traffic and Control Plane Separation: Following BMWG precedent (e.g., ), the methodology distinguishes between control-plane operations (e.g., pod deployment and CNI setup latency) and data-plane behavior (e.g., packet forwarding performance), allowing comprehensive benchmarking of CNIs across operational dimensions.
Scalability and Stress Testing: The methodology incorporates stress and scalability scenarios, consistent with goals in , to uncover performance degradation points and assess operational resilience of CNIs under heavy load and fault conditions.
Model Reference: CNIs in Kubernetes follow the models described in .

This alignment ensures that future extensions of this document toward a formal benchmarking specification can be scoped within the BMWG charter and contribute to standardized practices for container network evaluation.

Abbreviations

CNI: Container Network Interface
SUT: System Under Test
DUT: Device Under Test
SDN: Software Defined Networking
OVS: Open vSwitch
OVN: Open Virtual Network
RTT: Round-Trip Time
eBPF: Extended Berkeley Packet Filter
ENI: Elastic Network Interface
QoE: Quality of Experience

Scope of Metrics The core benchmarking metrics in this document, such as latency, throughput, jitter, packet loss, and pod lifecycle time, are aligned with BMWG practices. Additional metrics such as resource usage, energy efficiency, and operational ease are included to reflect real-world operator concerns but are considered informational and outside the core BMWG scope.

CNI Benchmarking Key Aspects While several performance-benchmarking suites are already available from CNI providers , the open-source community , and also in the IETF BMWG , a comprehensive CNI evaluation SHOULD incorporate relevant performance metrics, scalability aspects and identify bottlenecks. This section provides a view on relevant aspects to ensure reliable and replicable performance evaluation, considering aspects that are relevant from a telco-cloud perspective.

Core Performance Metrics for CNI Benchmarking Considering the architecture of microservice-based applications, microservices may interact with each other and external services. Having containerized applications and orchestration platforms like Kubernetes, there is a continuous need to address communication and networking as Kubernetes doesn't handle networking itself. Moreover, communication between containers is extremely important to meet QoS requirements of applications. To evaluate the performance of CNIs there are several metrics that should be taken into account including network throughput, end-to-end latency, pod setup and deletion times, CPU and Memory utilization, etc. This section defines the core benchmarking metrics used to assess the performance of Container Network Interface (CNI) plugins in Kubernetes environments. The metrics conform to the standard benchmarking framework set forth in , , , and are extended where necessary to include container-specific control-plane considerations. Measurements MUST be conducted under controlled conditions as described in Section 8, and SHOULD include both steady-state and dynamic workloads.

Data Plane Performance Metrics Benchmarking Quality of Service (QoS) for CNI plugins typically focuses on traditional performance metrics such as one-way latency, round-trip delay, packet loss, jitter, and achievable data rates under varied network conditions. These metrics are fundamental to assessing the efficiency and responsiveness of a CNI in both intra-cluster and inter-cluster communication scenarios. To ensure comprehensive evaluation, the benchmarking methodology SHOULD include tests using multiple transport protocols, primarily TCP and UDP. This is essential, as CNI plugins may exhibit significantly different performance profiles depending on the protocol type due to variations in connection setup, flow control, and packet processing overhead. For TCP, two key test modes are RECOMMENDED:

TCP_RR (Request/Response): Measures the rate at which application-layer request/response pairs can be exchanged over a persistent TCP connection. This reflects transaction latency under connection reuse scenarios.
TCP_CRR (Connect/Request/Response): Assesses the rate at which new TCP connections can be established, used for a request/response exchange, and torn down. This test exposes connection setup overhead and potential scalability bottlenecks.

For UDP, the benchmark SHOULD include UDP_RR testing, which captures round-trip time (RTT), latency variation (jitter), and packet loss characteristics under lightweight, connectionless exchanges. In all tests, the benchmarking suite MUST include a representative range of payload sizes, including at least 64 bytes, 512 bytes, and 1500 bytes. If supported by the underlying network and CNI plugin, jumbo frames (e.g., MTU > 1500 bytes) SHOULD also be tested to expose potential fragmentation penalties and their impact on latency, jitter, and throughput. These metrics evaluate the efficiency of packet forwarding and transport under varying traffic patterns, and are REQUIRED:

One-Way Latency (ms) SHOULD be measured using timestamped probes .
RTT (ms) SHOULD be measured via TCP_RR, TCP_CRR, and UDP_RR test modes. .
Throughput (Mbps or Gbps) SHOULD be assessed via the highest sustained rate of succesful packet delivery for the CNI without packet loss .
Packet loss rate (%) SHOULD be considered for reliability and congestion tolerance of the CNI .
Jitter MAY be relevant to assess variability. High jitter may indicate queuing inefficiencies or variable path latency .
Packet size variability SHALL be evaluated using a representative set of frame sizes (64B, 512B, 1500B). If jumbo frames (>1500B) are supported, testing MUST include these cases to expose fragmentation overheads .
Concurrent flow handling SHOULD be measured using concurrent connections and sustained request/response patterns for both TCP and UDP .

Control Plane Performance Metrics These metrics evaluate the responsiveness of the CNI plugin and Kubernetes components during pod and network lifecycle operations and are REQUIRED:

Pod initialization time (s) SHOULD be measured from kubelet interaction to completion of CNI ADD operation .
Pod deletion time (s) SHOULD be measured to understand issues with tear down .
CNI plugin deployment time (s) SHOULD be assessed, to understand the duration required for each CNI plugin to be fully deployed across the whole network (cluster nodes).

System Resource Performance Metrics These metrics are essential in resource-constrained environments (e.g., edge deployments) where efficiency impacts scalability and are RECOMMENDED:

CPU/GPU utilization SHOULD be reported per node and per CNI process .
Memory utilization (MB/GB) measurements MUST consider average and peak memory used by the CNI .
CNIs SHOULD be evaluated under varying load conditions (idle, low-traffic, high traffic).

The CPU and memory footprint of a Container Network Interface (CNI) plugin has substantial implications for workload density and system scalability, especially in resource-constrained or heterogeneous environments. In modern Edge-to-Cloud deployments often comprising diverse processor architectures (e.g., ARM64, AMD64) and variable memory constraints resource efficiency is critical to maximizing node utilization and sustaining performance. The architectural design of a CNI directly affects its resource profile. CNIs with extensive feature sets and complex data-plane capabilities such as policy enforcement, encryption, overlay encapsulation (e.g., VXLAN, IP-in-IP), or eBPF/XDP acceleration tend to exhibit higher CPU and memory consumption. For example, CNIs that perform user-space packet processing typically incur higher overhead, as each packet traverses the kernel-user boundary multiple times, resulting in increased CPU cycles and memory copies . In contrast, in-kernel eBPF-based processing can reduce such overhead by executing directly in the Linux kernel . In cloud-native deployments, CNIs that manage external interfaces (e.g., Elastic Network Interfaces (ENIs) in public cloud environments) may also introduce persistent memory usage due to API caching, state tracking, and metadata management . These variabilities are further amplified under dynamic workloads. It is frequently observed that a CNI optimized for high-throughput TCP bulk traffic may perform suboptimally under UDP-heavy traffic, high pod churn, or policy-intensive workloads. These behavioral differences necessitate a systematic and multi-dimensional benchmarking approach. Accordingly, a robust benchmarking methodology SHOULD assess each CNI under at least three operating states: idle, low-traffic (and low load), high traffic (and high load). Such profiling enables the identification of baseline resource usage, saturation thresholds, and degradation points ("performance peaks"). Measurements SHOULD be taken at both the node level (e.g., using Prometheus ) and at the container or pod level (e.g., using cAdvisor ). These practices are consistent with recommendations for virtualized and cloud-native benchmarking environments as described in .

Extended Performance Metrics (Optional) While outside the core BMWG scope, these metrics reflect real-world operator needs and may be included for extended analysis, in particular for edge-cloud heterogeneous and resource constrained scenarios. As such, the following metrics are RECOMMENDED:

Policy enforcement delay (ms)
Telemetry overhead.
Power and energy consumption (J per bit).Where applicable, node- or pod-level energy usage MAY be reported using tools such as Kepler . Results SHOULD include error margins due to estimation variance, or energy models.

While not core to BMWG benchmarking, and currently non-nomartive, energy metrics MAY be collected where relevant. Tools such as Kepler MAY be used, but results SHOULD be accompanied by a disclaimer about accuracy limitations in virtualized environments, and also on issues related with the applied energy models. A related discussion on energy metrics and energy-sensitivity can be found in IETF GREEN, , and in the IRTF NMRG , as well as in IRTF SUSTAIN.

Extended Quality of Experience for DevOps and Developers (Optional) Quality of Experience (QoE) benchmarking for Container Network Interface (CNI) plugins extends beyond conventional network performance metrics such as latency and throughput. It focuses on assessing operational usability, deployment efficiency, and portability, i.e., factors that directly affect the user experience of platform administrators, DevOps engineers, and developers. For instance, time to deploy or configure the CNI, ease of troubleshooting, and impact of the CNI on application performance are examples of QoE parameters. Key QoE indicators OPTIONAL MAY include:

Deployment time, the time required to install or upgrade a CNI plugin using declarative tooling (e.g., Helm charts, YAML manifests).
Configuration simplicity, the extent to which configuration is automated, validated, and integrated with Kubernetes-native workflows.
Troubleshooting tooling, the presence of purpose-built CLI utilities that simplify diagnostics, expose internal CNI state, and reduce reliance on low-level log inspection or manual kubectl commands.

For example, CNI-specific command-line interfaces such as cillium and calicoctl provide capabilities such as one-command installation, real-time policy and connectivity status, and automated diagnostics. The cillium status --verbose command provides IPAM allocations, agent health, and datapath metrics, while the calicoctl node diags generates complete diagnostic bundles for analysis. CNI integration with Kubernetes distribution CLIs (e.g., k3s, MicroK8s) further improves QoE by streamlining lifecycle operations. For instance, MicroK8s leverages snap-based add-ons that can enable or disable CNIs via a single command, reducing complexity and configuration drift.Although these attributes are not part of the core benchmarking metrics defined by BMWG, their inclusion is RECOMMENDED to reflect practical DevOps concerns and enhance the applicability of CNI benchmarking results in production environments.

Interoperability and Scalability To ensure comprehensive benchmarking coverage, scalability and stress-testing phases SHOULD be incorporated into the evaluation methodology. These phases are essential to identify the performance ceilings of a given CNI plugin and to assess its behavior under saturation conditions, including whether key observability features remain functional. Such assessments are consistent with guidance outlined in and extend benchmarking scope beyond nominal operation to failure and recovery modes. Stress tests SHOULD simulate high-load scenarios by concurrently scaling multiple Kubernetes components. This includes initiating rapid pod-creation bursts, deploying multiple concurrent services and network policies, and triggering controlled resource exhaustion events (e.g., CPU throttling, memory pressure, disk I/O contention). Furthermore, network issues such as increased latency, jitter, or packet loss SHOULD be introduced using tools like to assess the CNI's robustness under adverse network conditions. The use of orchestration tools such as Kube-Burner and chaos engineering frameworks (e.g., Chaos Mesh or Litmus) is RECOMMENDED to coordinate scalable and repeatable test scenarios. Network performance metrics during stress tests MAY be collected with traffic generators such as iperf3, netperf, or k6 . Benchmark results SHOULD include degradation thresholds, error rates, recovery latency, and metrics export consistency under stress to support the evaluation of CNI resilience and operational observability.

Observability and Bottleneck Detection Observability is critical in identifying performance bottlenecks that may arise due to CNI behavior under stress conditions. Benchmarking SHOULD assess the ability of CNIs to expose metrics such as packet drops, queue lengths, or flow counts through standard telemetry interfaces (e.g., Prometheus, OpenTelemetry). Effective bottleneck detection tools and visibility into the data path are essential for root cause analysis. CNIs that provide native observability tooling (e.g., Cilium Hubble) SHOULD be benchmarked for the overhead and fidelity of these features. In federated or multi-cluster environments, observability becomes a distributed operation spanning multiple control and data planes. Benchmarking MUST therefore evaluate how CNIs and associated telemetry systems aggregate, synchronize, and correlate metrics across clusters. This includes measuring propagation delays, timestamp alignment, and aggregation accuracy when telemetry data flow through federated collectors/monitoring backends (e.g., Prometheus-Thanos, Cortex). Benchmarks SHOULD also assess the ability to localize inter-cluster bottlenecks such as congested tunnels, gateway saturation, or asymmetric routing, distinguishing local clusters from cross-cluster traffic degradation.

Kubernetes CNI topologies Kubernetes CNI topologies refers to patterns of network connectivity in a Kubernetes environment used for testing or benchmarking CNIs .

Highly-coupled container-to-container communications
Pod-to-Pod communications
Pod-to-Service communications
External-to-Service communications

The benchmarking network topology must operate as an isolated test environment and MUST NOT connect to any devices that could forward test traffic into a production network or incorrectly route it to the test management network and .

CNI Behavior in Federated and Multi-Cluster Environments While existing works such as and provide benchmarking methodologies for virtualized and containerized infrastructures, their scope does not extend to CNI behavior in multi-cluster or federated deployments. Architectural drafts like , , and discuss aspects of multi-cluster operations and security, but do not specify CNI-focused, measurable performance parameters and considerations. Similarly, introduces the notion of multi-cluster service deployment and intent-based interconnection, yet it does not cover CNI-level performance benchmarking across federated clusters.

Overview of Federated Networking Federated and multi-cluster environments extend the scope of container networking beyond single operational domains. These architectures enable scalability, geographical distribution, isolation, and service proximity to end users, which are key properties for multi-domain cloud-native infrastructures. Federated CNI benchmarking is particularly relevant to Telco-Cloud and 6G scenarios, where workloads are distributed between cloud and (far-)edge IoT domains, introducing additional considerations compared to single-cluster deployments. In such environments, multiple clusters operate as autonomous domains while being interconnected through federation layers or multi-cluster networking mechanisms. Examples include popular third-party solutions such as Submariner, Liqo, Karmada, and Open Cluster Management (OCM), which provide network connectivity, service discovery, and workload scheduling across clusters. In this context, CNIs are often extended by multi-cluster gateways or overlays to facilitate inter-cluster pod-to-pod and service-to-service communication. Such interconnections can rely on encapsulation protocols (e.g., VXLAN, IPSec, WireGuard) or Layer-7 service meshes (e.g., Istio, Linkerd, Consul, Open Service Mesh) - based on Envoy proxy and sidecars.

Benchmarking Considerations for CNIs in Federated Environmentsg Benchmarking CNIs in federated deployments MUST explicitly reflect how (i) architectural choices, (ii) topology and connectivity, (iii) overlay and tunneling mechanisms, (iv) synchronization, and (v) security enforcement affect network behavior for both (i) data-plane and (ii) control-plane operations. The following factors are key:

Federation and Topology Models: CNIs may operate under hub-and-spoke , , neighboring, full-mesh , , or hierarchical topologies. Each model introduces distinct path lengths and potential bottlenecks and security concerns. Benchmarks SHOULD quantify metrics like latency, jitter, and packet loss across these models.
Overlay, Encapsulation, and Encryption Mechanisms: CNIs may rely on native multi-cluster extensions (e.g., Cilium ClusterMesh) or external overlays (e.g., Submariner tunnels) with optional encryption (e.g., IPSec, WireGuard). Tests SHOULD measure the combined encapsulation and cryptographic overhead, including per-packet header size, MTU effects, CPU utilization, and throughput reduction compared to unencrypted baselines.
Routing, Policy, and Synchronization Behavior: CNIs synchronize endpoints, routes, and network policies across clusters. Benchmarking SHOULD measure propagation delay, convergence time, and consistency under dynamic conditions such as node joins, removals, or policy updates. Resource utilization (CPU, memory, and bandwidth) during synchronization SHOULD also be recorded.
Cross-Cluster Connectivity and Load Balancing: Evaluation SHOULD include one-way and RTT latency, throughput, and packet loss between pods located in different clusters. When multi-cluster services distribute requests, benchmarks SHOULD assess fairness as well as responsiveness to endpoint or cluster failures that influence path selection and recovery behavior.
Quality of Service (QoS) and Policy Enforcement: CNIs that implement QoS tagging or traffic shaping (e.g., Cilium's eBPF/EDT-based pacing, Calico's DSCP marking and policy-driven shaping, Antrea's TrafficControl, or Kube-OVN's QoS queues) SHOULD be evaluated for their ability to maintain SLA/SLO across clusters and overlays. Benchmarks SHOULD also verify that isolation and access-control policies (e.g., deny/allow rules) remain consistent across domains.
Resiliency and Recovery Performance: Benchmarking SHOULD assess CNI behavior during multi-cluster fault conditions, including inter-cluster link loss, control-plane failures, restarts, or topology reconfiguration. Measurements SHOULD include reconvergence time, packet loss, and recovery time to steady-state. Benchmarks SHOULD also evaluate route re-establishment latency and transient traffic interruption duration to characterize the CNI's overall fault-tolerance behavior.

Best Practice Operational Example: CODEF CODEF is an open-source, modular benchmarking environment that supports the evaluation of containerized workloads in edge-to-cloud infrastructures. CODEF adopts a microservice-based architecture to streamline experimentation through abstraction, automation, and reproducibility. CODEF is logically divided into four functional layers, each implemented as an independent containerized microservice: Infrastructure Manager, Resource Manager, Experiment Controller, and Results' Processor, as represented in Figure 1. This modular design ensures extensibility and facilitates integration with diverse technologies across the experimentation pipeline.

CODEF and its components. | Infrastr Mgrs |---> | physical,VM,cloud | | +---------------+ +-------------------+ | Deploy Resource Managers per node | | Containers | +---------------+ +----------+ |----> | Resource MgrA |<-->| Master | SW / App | +---------------+ +----------+ +---------+ |----> | Resource MgrB |<-->| Worker1 |<-->| Ansible | | +---------------+ +----------+ +---------+ |----> | Resource MgrC |<-->| WorkerX | | +---------------+ +----------+ | | Container | Execute Exper +----------------+ +------------+ +-------------> | Experiment Ctr |<-->| Iteration, | | +----------------+ | Metrics | | +------------+ | Container | Output Results +-------------------+ +-------------+ +-------------> | Results Processor |<-->| Processing, | +-------------------+ | Stats, LaTeX| +-------------+ ]]>

The Infrastructure Manager layer provisions cluster resources across heterogeneous environments, including bare-metal nodes, hypervisor-based virtual machines (e.g., VirtualBox, XCP-ng), and public or academic cloud testbeds (e.g., AWS, CloudLab, EdgeNet).
The Resource Manager deploys software components on each node using parameterized Ansible playbooks. A dedicated instance of the Resource Manager operates per node to guarantee consistent, automated software setup.
The Experiment Controller coordinates workload execution, manages experimental iterations, collects measurement data, and invokes benchmarks.
The Results' Processor performs statistical analysis and post-processing to generate structured outputs, including visualization and reporting artifacts.

CODEF supports full automation of the experimentation lifecycle, from cluster instantiation to metric analysis. Each cluster is provisioned from clean operating system images to ensure consistency, repeatability, and environmental isolation across benchmark runs. This approach eliminates state leakage between tests and enhances comparability. The framework also provides low-level parameterization options for various networking and security configurations. These include tunneling and encapsulation mechanisms (e.g., VXLAN, Geneve, IP-in-IP), encryption protocols (e.g., IPsec, WireGuard), and Linux kernel-based datapath acceleration features (e.g., eBPF and XDP). Such flexibility supports the emulation of production-grade deployments across a wide range of container network interfaces (CNIs) and infrastructure types.

CODEF Benchmarking and CNI Support CODEF addresses the need for repeatable, infrastructure-agnostic benchmarking across the edge-to-cloud continuum. It supports a broad spectrum of third-party CNIs plugins, including Antrea , Calico , Cilium , Flannel , Weave Net , Kube-Router , Kube-OVN , and Multus, as well as emerging solutions such as L2S-M . These CNIs can be deployed and benchmarked across multiple Kubernetes distributions, including upstream Kubernetes (vanilla), lightweight variants such as K3s, K0s, and MicroK8s, and production-grade clusters. Each CNI plugin employs distinct architectural strategies at the network layer, such as underlay versus overlay models, use of encapsulation protocols (e.g., VXLAN, Geneve), encryption mechanisms (e.g., WireGuard, IPsec), and programmable datapaths (e.g., eBPF/XDP). Additionally, the degree of support for network policy enforcement, observability, and integration with Kubernetes-native APIs varies significantly across implementations. These differences introduce variability in performance, scalability, and resource utilization depending on workload and deployment characteristics. CODEF enables the consistent application of benchmarking procedures across this heterogeneity by offering a unified, declarative methodology. It abstracts infrastructure-specific details and enforces environmental consistency through repeatable provisioning, workload orchestration, and result normalization. Accordingly, any benchmarking methodology targeting CNIs in diverse Kubernetes environments SHOULD account for these dimensions: CNI architecture, Kubernetes distribution, infrastructure type, and test scenario configuration to ensure meaningful, comparable, and reproducible results.

Environment Configuration Aspects In addition to the functional differences among CNI plugin implementations, benchmarking methodologies SHOULD account for the architectural and physical characteristics of the deployment environment. Key variables include the type of infrastructure such as virtualized environments (e.g., VM or hypervisor-based) versus bare-metal deployments and the test topology, including intra-node (same host) versus inter-node (across hosts) communication. Benchmarks SHOULD also distinguish between distributions designed for general-purpose Kubernetes (e.g., vanilla K8s) and those optimized for constrained edge deployments (e.g., MicroK8s, K3s). Hardware heterogeneity introduces further variability. Performance results can be significantly influenced by CPU architecture (e.g., x86_64 vs. ARM), number of cores and threads, memory speed and hierarchy, cache layout, NUMA topology, and network interface characteristics (e.g., NIC model, offload capabilities, and firmware version). Low-level system configuration options, including MTU size, tunneling mode (e.g., VXLAN, IP-in-IP), and kernel datapath tuning (e.g., eBPF or XDP parameters), MAY also affect observed performance. Empirical results from experiments conducted with CODEF under a variety of scenarios including intra- and inter-cluster configurations, hardware with diverse specifications, and a range of Kubernetes distributions demonstrated measurable performance differences across CNI plugins. Notably, significant disparities were observed not only between different CNI implementations, but also within the same CNI when deployed on different Kubernetes distributions or system architectures. Contrary to expectation, deploying lightweight CNI plugins on edge-optimized distributions does not always result in improved efficiency. In some cases, plugins reduce their resource footprint by sacrificing performance (e.g., selecting a simpler encapsulation mechanism), while others achieve better throughput when paired with more capable general-purpose distributions at the expense of increased overhead. These trade-offs SHOULD be explicitly captured in benchmarking outcomes. Importantly, the optimal CNI and distribution pairing is often workload-dependent. A configuration that appears suboptimal in terms of raw resource usage MAY outperform a lightweight alternative for certain traffic patterns, application behaviors, or network policies. As such, benchmarking methodologies intended for heterogeneous edge-cloud scenarios, in particular mobile scenarios and IoT scenarios, where embedded devices are a main part of the overall networking infrastructure, SHOULD incorporate these dimensions and evaluate plugin behavior across representative workloads and system conditions.

Measurement Tools CODEF relies on Ansible playbooks to provision a suite of software tools supporting both workload generation and measurement. Benchmarking configurations may include lightweight and comprehensive traffic generators such as , , and , as well as the . These tools enable detailed measurements of network bandwidth, packet throughput, latency, and fragmentation behavior across TCP and UDP protocols, with varying message sizes. Resource usage metrics such as CPU load, memory consumption, and disk utilization are collected at both node and container granularity. Observability stacks based on Prometheus and Grafana are integrated for real-time metric capture, historical trend visualization, and alerting capabilities. These facilities support traceability of system behavior during experiments and assist in identifying anomalous performance characteristics. For scalability and resilience benchmarking, CODEF integrates load and stress testing tools such as the CNCF and chaos engineering platforms (e.g., Chaos Mesh or Litmus). These tools simulate dynamic workloads, rapid pod scaling, and fault injection to evaluate system performance under adverse or bursty conditions. Such orchestrated testing scenarios are essential to reveal bottlenecks, performance degradation points, and recovery latency under operational stress. Power consumption profiling is optionally supported through empirical estimation models or telemetry-based measurement frameworks such as . However, their accuracy SHOULD be evaluated critically, as results may vary depending on the availability and quality of hardware-level counters (e.g., Intel RAPL) and the characteristics of the execution platform, particularly in virtualized or non-Intel environments.

Kubernetes CNI Benchmarking Telco-Cloud Methodology This section defines a set of best practice guidelines for benchmarking Kubernetes CNI plugins in telco-cloud and edge-clloud environments. The approach is aligned with IETF BMWG, emphasizing reproducibility, transparency, comparability. The benchmarking recommendations presented herein aim to be applicable across a wide range of deployment scenarios, Kubernetes distributions, and CNI implementations. While selected operational workflows and experiences from CODEF are considered to illustrate practical implementation of these best practices, the methodology itself is designed to remain tool-agnostic and aligned with standardized benchmarking guidance. The practices focus on controlled environment setup, test repeatability, performance metric collection, observability, and result reporting. Attention is given to relevant characteristics for telco and edge environments, including resource constraints, deployment diversity, and protocol behavior under stress. The goal is to provide a consistent and extensible benchmarking methodology for CNIs operating in dynamic, distributed, and microservice-oriented infrastructure environments.

Controlled Test Environments Benchmarking SHOULD be conducted in isolated testbeds with no extraneous traffic or workloads. The following practices help reduce environmental noise and increase determinism:

Use bare-metal or dedicated VMs for benchmarking to avoid cross-tenant interference.
Ensure consistent CPU pinning and disable power-saving features or CPU frequency scaling to stabilize performance measurements.
Synchronize clocks across test nodes using NTP or PTP for accurate latency and jitter measurement.

Standardized Test Configurations Benchmarking SHOULD adhere to pre-defined configurations to enable comparability across CNIs and platforms, aligning with . The following elements MUST be documented:

Kubernetes version and distribution.
CNI plugin version and configuration parameters.
Kernel version and system tunables (e.g., MTU size, sysctl options).
CPU model, memory size, and network interface type.

Test Repeatability and Statistical Significance Each experiment SHOULD be repeated a minimum of five times. For latency and throughput metrics, results MUST be reported using:

Minimum, average (median), maximum.
at least 90th, and 95th percentile values.

Furthermore, adequate warm-up times when starting test runs, and cool-down periods between test runs SHOULD be included to prevent thermal bias or residual resource contention. Where possible, automation frameworks (e.g., CODEF, Ansible) SHOULD be used to ensure that each experiment is launched from a clean state.

Traffic Generators, Traffic Models and Load Profiles Traffic generators MUST support multiple transport protocols (e.g., TCP, UDP) and varying packet sizes as well as interrarrival packet rates. Benchmarking tools such as iperf3, netperf, and sockperf are RECOMMENDED. For realistic CNI evaluation:

TCP_RR, TCP_CRR, and UDP_RR SHOULD be used to measure latency, jitter, and throughput.
Multiple flows and concurrent connections SHOULD be tested to simulate microservice interactions.

Benchmarks SHOULD include traffic profiles reflecting real-world microservice communications, such as:

Short-lived TCP connections (request/response.
Persistent streaming (large payloads, high throughput).
Burst UDP traffic for latency and packet loss analysis.

Workload Simulation, Emulation, and Stress Testing To evaluate performance under real-world loads, benchmarking MUST include scenarios with:

Small, average, high pod churn rates (creation/deletion).
Concurrent service access and policy enforcement.
Synthetic network and node failure

Tools such as kube-burner, chaos-mesh, and tc-netem are RECOMMENDED to orchestrate these scenarios, aligning with stress test guidance in .

Observability and Resource Instrumentation CNIs SHOULD expose internal metrics (e.g., policy hits, flow counts, packet drops). Benchmarks MUST capture:

CPU and memory usage per CNI pod/process via for instance Prometheus.
NIC statistics.
Network path visibility (e.g., using Cilium Hubble or Calico flow logs)

Experimental and open-source examples on how such metrics can be captured at a node and network level can be checked in the CODECO project and respective code . Resource metrics MUST be collected at both node-level and pod-level granularity.

Result Reporting and Output Format Benchmarking outputs SHOULD:

Use machine-readable formats (e.g., JSON, YAML, YANG).
Clearly label all test parameters and metrics.
Include system logs, configuration manifests, and tool versions.

A common results schema SHOULD be developed to support comparative analysis and long-term reproducibility, in line with goals in .

IANA Considerations This document has no IANA considerations.

Security Considerations Benchmarking tools and automation frameworks may introduce risk vectors such as elevated container privileges or misconfigured network policies. Experiments involving stress tests or fault injection should be performed in isolated environments. Benchmarking outputs SHOULD NOT expose sensitive cluster configuration or node-level details.

References Normative References Informative References CODECO Experimental Framework CODECO Consortium CODECO D12 - Basic Operation Components and Toolkit version 2.0. CODECO Consortium Energy-aware Differentiated Services (EA-DS). IETF draft draft-sofia-green-energy-aware-diffserv-00, active CODECO Deliverable D10: Technological Guidelines, Reference Architecture, and Open-source Ecosystem Design CODECO Consortium Considerations for Benchmarking Network Performance in Containerized Infrastructures, draft-ietf-bmwg-containerized-infra-07, active Antrea CNI Antrea Project Project Calico Tigera, Inc. Cilium: eBPF-based Networking, Security, and Observability Cillium Authors Kubernetes Documents-Cluster Networking Kubernetes Authors Flannel CNI Plugin flannel-io Kube-OVN: A Cloud-Native SDN for Kubernetes Kube-OVN Project Kube-Router: All-in-One CNI, Service Proxy, and Network Policy Kube-Router Community Weave Net: Fast, Simple Networking for Kubernetes Weaveworks (archived) Cilium Benchmarking Tools Cillium Authors Benchmarking Kubernetes Container Network Interfaces: Methodology, Metrics, and Observations Amazon EKS Pod Networking with the AWS VPC CNI Amazon Web Services Prometheus Monitoring System Overview Prometheus Authors cAdvisor: Container Advisor Google tc-netem: Network Emulation Linux Foundation Kube-Burner: Kubernetes Performance and Scalability Tool Cloud-Bulldozer Project iPerf3: Network Bandwidth Measurement Tool ESnet / Lawrence Berkeley National Lab k6: Modern Load Testing Tool Grafana Labs L2S-M: Lightweight Layer 2 Switching for Microservice Networks Universidad Carlos 3 de Madrid Netperf: Network Performance Benchmark Hewlett Packard Enterprise SockPerf: RDMA and TCP/UDP Latency Benchmark NVIDIA Mellanox Kubernetes Bench-Suite CNCF CNF Test Suite Kepler: Kubernetes-based Power Estimation and Reporting CNCF Energy-Aware Networked Systems for a Sustainable Future Multi-Edge Architecture for the Internet of Things (IoT). IETF draft draft-dwon-t2trg-multiedge-arch-02, Expired Service Mesh-based Data Transfer Architecture. IETF draft draft-si-service-mesh-dta-01, Expired Workload Identity Best Practices. IETF draft draft-ietf-wimse-workload-identity-practices-00, active Interconnection Intents for Network Services. IETF draft draft-contreras-nmrg-interconnection-intents-05, Expired

Acknowledgements This work has been funded by The European Commission in the context of the Horizon Europe CODECO project under grant number 101092696, and by SGC, Grant agreement nr: M-0626, project SemComIIoT. We thank Minh-Ngoc Tran for his contributions towards alignment with the draft , and suggestions for the removal of the former section 4, which provided a CNI summary only.

Appendix A. Change Log -Since draft-samizadeh-bmwg-cni-benchmarking-00:

Section 4 and 5 were removed.
Added details about CNI Behavior in Federated and Multi-Cluster Environments.
Added details about Observability and Bottleneck Detection in multi-cluster or federated environments.
Revised references to Kubernetes network model and IETF drafts.
Minor editorial updates and formatting corrections.