Reinforcement Learning-Based Virtual Network Embedding: Problem Statement

Introduction and Scope Recently, Network virtualization (NV) technology has received a lot of attention from academics and industry. It allows multiple heterogeneous virtual networks to share resources on the same substrate network (SN) , . The current large-size fixed substrate network architecture is no longer efficient and not extendable due to network ossification. To overcome this limitations, traditional Internet Service Providers (ISPs) are divided into two independent parts which work together. One is the Service Providers (SPs) who create and own the different number of the VNs, and the other one is the Infrastructure Providers (InPs) who own the SN devices and links as underlying resources. SPs generate and construct the customized Virtual Network Requests (VNRs), and lease the resources from InPs based on that requests. In addition, two types of mediators can enter into the industry domain for better coordination of SPs and InPs. One is the Virtual Network Providers (VNPs) who assemble and coordinate diverse virtual resources from one or more InPs, the other one is the Virtual Network Operators (VNOs) who create, manage, and operate the VN according to the demand of the SPs. VNPs and VNOs could enable efficient use of the physical network and increase the commercial revenue of both SPs and InPs. NV can increase network agility, flexibility and scalability while creating significant cost savings. Greater network workload mobility, increased availability of network resources with good performance, and automated operations, are all the benefits of NV. Virtual Network Embedding (VNE) is one of the main technique and strategy which used to map a virtual network to the substrate network. VNE algorithm has two main parts, Node embedding: where virtual nodes of VN have to be mapped to the SN nodes, and Link ebbedding: where virtual links between the VNs have to be mapped to the physical paths in the substrate network. It has been proven to be NP-Hard, and both node and link embeddings have become challenging for the researchers. A virtual node and link should be efficiently embedded into a given SN, so that more VNR can be accepted with minimum cost. The distance of the virtual nodes from each other in a given SN is a big contribution to the link failures and causes the rejection of VNRs. Hence, an efficient and intelligent technique is required for VNE problem to reduce VNRs rejection . In the perspective of the InPs, the efficient VNE performs better mostly in terms of revenue, acceptance ratio, and revenue-to-cost ratio. Figure 1 shows the the example of two virtual network request VNR1 and VNR2 to embed them in the given substrate network. VNR1 contain three virtual nodes (a, b, and c) with cpu demands (15, 30, and 10) respectively, and the link between virtual the nodes a-b,b-c, and c-a with bandwidth demands 15,20, and 35 respectively. Similarly, VNR2 contains virtual nodes and links with cpu and bandwidth demand respectively. The purpose of the VNE algorithm to map the virtual nodes and links of the VNRs to the physical nodes and links of the given substrate as shown in Figure 1. .

Substrate network with embedded virtual network, VNR1 and VNR2 +----+ +----+ +----+ +----+ | a | | d | | e | | f | | 15 | | 25 |__ _25___| 30 |__ _35_ __| 45 | +----+ +----+ +----+ +----+ / \ \ / 15 35 30 20 / \ \ / +----+ +----+ +----+ +----+ | b | | c | | g | | h | | 30 |__ _20_ __ _| 10 | | 15 |__ _ __10__ __ __| 35 | +----+ +----+ +----+ +----+ (VNR1) (VNR2) || Embedding || Embedding VV VV +----+ +----+ +----+ +----+ .......| a |......35......| c | | d |........25........| e | : _____| 15 | | 10 |_______| 25 | ________| 30 | : | +----+ +----+ +----+ | +----+ : | A | | : B | : | C | : : | 50 |__ ___50__ __ __| : 60 |_:_ __30 _ _| 40 | : : +__________+ +_:_________+ : +__________+ : : | : | : | : 15 | : | : | 35 : 40 20 60 : 50 : : | : | 30 | : : | _:_____|_ : | : +----:..............20........|.: | : | +----+ | b | | +----+.....30......|.........|....: | | f | | 30 |_|___| g | | +----+ __|___| 45 | +----+ | 15 |.....10......|.......| h |........20.....|......+----+ | D +____+ | E | 35 | | F | | 50 |__ __ __ 70 _____| 40 +____+ ___ __ 50_ ___| 60 | +__________+ +_________+ +__________+ Recently, artificial intelligence and machine learning technologies have been widely used to solve networking problems , , . There has been a surge in research efforts,specially,reinforcement learning (RL) which has been contributed much more in the many complex tasks, e.g. video games and auto-driving etc. The main goal of an RL to learn better policies for sequential decision making problems (e.g., VNE) and solve them very efficiently. Problems such as node ordering, pattern matching, and network feature extraction can all be simplified by graph-related theories and techniques. Graph neural network (GNN) is a new type of ML model architecture that can aggregate graph features (degrees, distance to specific nodes, node connectivity, etc.) on nodes . The model can be used to cluster nodes and links according to the physical nodes and physical links attribute characteristics (CPU, storage, bandwidth, delay, etc.), and it is highly suitable for graph structures of any topological form. Hence, GNN is useful to find the best VNE strategy by intelligent agent training, and the organic combination of VNE and GCN has a good prerequisite. Designing and applying RL techniques directly into VNE problems is not yet trivial, but may face several challenges. This document describes the problems. Several works have appeared on the design of VNE solutions using RL, which focuses on how to interact with the environment to achieve maximum cumulative return , , , , , , , , , , , , , , , , . This document outlines the problems encountered when designing and applying RL-based VNE solutions. Section 2 describes how to design RL-based VNE solutions. Section 3 gives terminology, and Section 4 describes the problem space details.

Reinforcement Learning-based VNE Solutions As we discussed that RL has been studied in various fields (such as game, control system, operation research, information theory, multi-agent system, network system, etc.) and shows better performance than humans. Unlike deep learning, RL trains a policy model by receiving rewards through interaction with the environment without training label data. Recently, there have been several attempts to solve VNE problems using RL. When applying RL-based algorithms to solve VNE problems, the RL agent automatically learns without human intervention through interaction with the environment. Once the agent completed the learning process, it can generate the most appropriate embeddings decision (action) based on the state of the network. Based on the embedding or action the agent get reward from the environments to adaptively train its policy for future action. The RL agent gets the most optimized model based on the reward function defined according to each objective (revenue, cost, revenue to cost ratio and acceptance ratio). The optimal RL policy model provides the VNE strategy appropriately according to the objective of the network operator. Figure 2 shows the virtual network embedding solution based on RL algorithm. The RL is divided into a training process and an inference process. In the training process, state information is composed of various substrate networks and VNRs (Environment), which are used as suitable inputs for RL models through feature extraction. After that, the RL model is updated by model updater using a feature extracted state and reward. In the inference process, using the trained RL model, the embedding result is provided to the operating network in real time. The following figure shows the detail about RL method based virtual networks embedding solutions.

Two processes for RL method based VNE RL Model Training Process +--------------------------------------------------------------------+ | Training Environment | | +-------------------+ RL-based VNE Agent | | | +---------+ | +----------------------------------+ | | | | +---------+ | | Action | | | | | | +----------+ |<----------------------------------+ | | | | + | | Substrate| | | | | | | | | | Networks | | | +----------+ +----------+ | | | | + +----------+ | State | | Feature | | RL | | | | | |----------->|Extraction|----->| Model | | | | | +--------+ | | +----------+ | (Policy) | | | | | | +---------+ | | | +----------+ | | | | + | +---------+ | | | +---------+ A | | | | + | VNRs | | Reward | +-->| Model | | | | | | +---------+ |-------------------->| Updater |-----+ | | | +-------------------+ | +---------+ | | | +----------------------------------+ | +--------------------------------------------------------------------+ | Inference Process | +---------------------------------V----------------------------------+ | + - - - - - - - + | | Operating Network | RL Model | Trained RL Model | | (Inference Environment) | Training |------------------+ | | +-------------------+ | Process | | | | | +-----------+ | + - - - - - - - + | | | | | | | RL-based VNE Agent | | | | | Substrate | | +----------------------------|-----+ | | | | Network | | | Action | | | | | | | |<--------------------------------+ | | | | | +-----------+ | | | V | | | | +---------+ | | +------------+ +---------+ | | | | | +---------+ | State | | Feature | | Trained | | | | | + | +----------+ |----------->| Extraction |---->| RL | | | | | + | VNRs | | | +------------+ | Model | | | | | +----------+ | | +---------+ | | | +-------------------+ +----------------------------------+ | +--------------------------------------------------------------------+

Terminology

Network Virtualization: Network virtualization is the process of combining hardware and software network resources and network functionality into a single, software-based administrative entity, a virtual network .

Virtual Network Embedding (VNE): Virtual Network Embedding (VNE) is one of the main techniques used to map a virtual network to the substrate network.

Substrate Network (SN): The underlying physical network which contains the resources such as CPU and bandwidth for virtual networks is called substrate network.

Virtual Network Request (VNR): Virtual Network Request is a complete single Virtual network request containing virtual nodes and virtual links.

Agent: In RL, an agent is the component that makes the decision abd take action (i.e., embedding decision).

State: State is a representation (e.g., remaining SN capacity and requested VN resource) of the current environment, and it tells the agent what situation it is in currently.

Action: Actions (i.e., node and link embedding) are behavior an RL agent can do to change the states of the environment.

Policy: A policy defines an agent's way of behaving at a given time. It is a mapping from perceived states of environment to actions to be taken when in those states. It is usually implemented as a deep learning model because the state and action spaces are too large to be completely known.

Reward: A reward is the feedback which provides an agent to the agent for taking actions that lead to good outcomes (i.g., achieve the objective of the network operator).

Environment: An environment is the agent’s world in which it lives and interacts. The agent can interact with the environment by performing some action but cannot influence the rules of the environment by those actions.

Problem Space RL contains three main components: state representation, action space, and reward description. For solving a VNE problem, we need to consider how to design the three main RL components. In addition, a specific RL algorithm, training environment, sim2real gap, and generalization are also important issues that should be considered and addressed. We will describe each one in detail as follows.

State Representation The way to understand and observe the VNE problem is crucial for an RL agent to establish a thorough knowledge of the network status and generate efficient embedding decisions. Therefore, it is essential to firstly design the state representation that serves as the input to the agent. The state representation is the information which an agent can receive from the environment, and consists of a set of values representing the current situation in the environment. Based on the state representation, the RL agent selects the most appropriate action through its policy model. In the VNE problem, an RL agent needs to know the information of the overall SN entities and their current status in order to use the resources of the nodes and edges of the substrate network. Also it must know the requirements of the VNR. Therefore, in the VNE problem, the state usually should represent the current resource state of the nodes and edges of the substrate network (ie, CPU, memory, storage, bandwidth, delay, loss rate, etc.) and the requirements of the virtual node and link of the VNR. The collected status information is used as raw input, or refined status information through the feature extraction process is used as input for the RL agent. The state representation may vary depending on the operator's objective and VNE strategy. The method of determining such feature extraction and representation greatly affects the performance of the agent.

Action Space In RL, an action represents a decision that an RL agent can take based on current state representation. The set of all possible actions is called an action space. In the VNE problems, actions are generally divided into node embedding and link embedding. The action for node embedding means the VNR’s nodes are assigned to which nodes in the SN. Also, for link embedding, the action represents the selected paths between the selected substrate network nodes from the node embedding result. If the policy model of the RL agent is well trained, it will select the embedding result to maximize the reward appropriate for the operator's objectives. The output actions generated from the agent will indicate the adjustment of allocated resources. It is noted that, at each point of time step, an RL algorithm may decide to 1) embed each virtual node onto substrate nodes and then embed each virtual link onto substrate paths separately, or 2) embed the given whole VNR onto substrate nodes and links in the SN at once. In the former case, at every single step, a learning agent focuses on exactly one virtual node from the current VNR, and it generates a certain substrate node to host the virtual node. Link embedding is then performed separately in the same time step. To solve the VNE problem efficiently, mapping of virtual nodes and links are considered together, although they are performed separately. Link mapping is considering more complex than node mapping, because a virtual link can be mapped onto a physical path with different hops. On the other hand, at every single step, a learning agent can try to embed the given whole VNR, i.e., all virtual nodes and links in the given VNR, onto a subset of SN components. The whole VNR embedding should be handled as a graph embedding, so that the action space is huge and the design of the RL algorithm is usually more difficult than the one with each node and link embedding.

Reward Description Designing rewards is an important issue for an RL algorithm. In general, the reward is the benefit that an RL agent follows when performing its determined action. Reward is an immediate value that evaluates only the current state and action. The value of reward depends on success or failure of each step. In order to select the action that gives the best results in the long run, an RL agent needs to select the action with the highest cumulative reward. The reward is calculated through the reward function according to the objective of the environment, and even in the same environment, it may be different depending on the operator’s objective. Based on the given reward the agent can evaluate the effectiveness to improve the policy. Hence, the reward function play a important rules in the training process of RL. In the VNE problem, the overall objectives are to reduce the VNE rejection, embed them with minimum cost, maximize the revenue, and increase the resource utilization of physical resources. Reward function should be designed to achieve one or multiple ones of these objectives. Each objective and its correspondent reward design are outlined as follows:

Revenue: Revenue is the sum of the virtual resources requested by the VN, and calculated to determine the total cost of the resources. Typically, a successful action (e.g., VNR is embedded without violation) is treated to be a good reward which also increases the revenue. Otherwise, a failed action (e.g., VNR is rejected) leads that the agent will receive a negative reward as well as decreasing the revenue.

Acceptance Ratio: Acceptance ratio is the ratio measured by the number of successfully embedded virtual network requests divided by total number of virtual network requests. To achieve a high acceptance ratio, the agent is trying to embed maximum VNR and get a good reward. Getting a good reward is usually proportional to the acceptance ratio.

Revenue-to-cost ratio: To balance and compare the cost of resources for embedding VNR, the revenue is divided by cost. Revenue-to-cost ratio compares the embedding algorithms with respect to their embedding results in terms of the cost and revenue. Since most VNOs are most interested in this objective, a reward function should be made to relate to this performance metric.

Policy and RL Algorithms The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions that promise the highest reward. Therefore, an RL agent updates its policy repeatedly in the learning phase to maximize the expected cumulative reward. Unlike supervised learning, in which each sample has a corresponding label indicating the preferred output of the learning model, an RL agent relies on reward signals to evaluate the effectiveness of actions and further improve the policy. From the perspective of RL, the goal of VNE is to find an optimal policy to embed an VNR onto the given SN in any state at any time. There are two types of RL algorithms: on-policy and off-policy. In on-policy RL algorithms, the (behaviour) policy of the exploration step to select an action and the policy to learn are the same. On-policy algorithms work with a single policy, and require any observations (state, action, reward, next state) to have been generated using that policy. Representative on-policy algorithms include A2C, A3C, TRPO, and PPO. On the other hand, off-policy RL algorithms work with two policies. These are a policy being learned, called the target policy, and the policy being followed that generates the observations, called the behaviour policy. In off-policy RL algorithms, the learning policy and the behaviour policy are not necessarily the same. It allows the use of exploratory policies for collecting the experience, since learning and behavior policies are separated. In the VNE problem, various experiences can be accumulated by extracting embedding results using various behavior policies. Representative off-policy algorithms include Q-learning, DQN, DDPG, and SAC. There are different classifications for RL algorithms: model-based and model-free. In model-based RL algorithms, an RL agent learns its optimal behavior indirectly by learning a model of the environment by taking actions and observing the outcomes that include the next state and the immediate reward. The models predict the outcomes of actions. The model is used instead of the environment or in addition to interaction with it to learn optimal policies. This becomes, however, impractical when the state and action space is large. Unlike model-based algorithms, model-free RL algorithms learn directly by trial and error with the environment and do not require the relatively large memory. Since data efficiency or safety is very important even in VNE problems, the use of model-based algorithms can be actively considered. However, since it is not easy to build a good model that mimics a real network environment, a model-free RL algorithm may be more suitable for VNE problems. In conclusion, a good RL algorithm selection plays an important role in solving the VNE problem, and VNE performance metrics vary depending on the selected RL algorithm.

Training Environment Simulation is the use of software to simulate an interacting environment that is difficult to actually execute and test. An RL algorithm learns by iteratively interacting with the environment. However, in the real environment, various variables such as failure and component consumption exist. Therefore, it is necessary to learn through a simulation that simulates the real environment. In order to solve the VNE problem, we need to use a network simulator similar to the real environment because it is difficult to repeatedly experiment with real network environments using an RL algorithm, and it is very challenging and overwhelming to directly apply an RL algorithm to real-world environments. When solving VNE problems, a network simulation environment similar to a real network is required. The network simulation environment should have a general SN environment and VNR required by the operator. The SN has nodes and links between nodes, and each has capacity such as CPU and Bandwidth. In the case of VNR, there are virtual nodes and links required by the operator, and each must have its own requirements.

Sim2Real Gap An RL algorithm iteratively learns through a simulation environment to train a model of the desired policy. The trained model is then applied to the real environment and/or tuned more for adapting to the real one. However, when the trained model is applied in the simulation to the real environment, sim2real gap problem arises. Obviously, the simulation environment does not match perfectly to the real environment which mostly fails in the tuning process and gives poor performance in the model because of the Sim2Real gap. The sim2real gap is caused by the difference between the simulation and the real environment. It is because the simulation environment cannot perfectly simulate the real environment, and there are many variables in the real environment. In a real network environment for VNE, the SN's nodes and links may fail due to external factors, or capacity such as CPU may change suddenly. In order to solve this problem, the simulation environment should be more robust or the trained RL model should be generalized. To reduce the gap between sim and real network environments we need to train our model with an efficient and large number of VNR and keep learning the agent not only depend on previous memorization.

Generalization Generalization refers to the trained model's ability to adapt properly to previously unseen new observations. An RL algorithm tries to learn a model that optimizes some objective with the purpose of performing well on data that has never been seen by the model during training. In terms of VNE problems, the generalization is a measure of how the agent’s policy model performs on predicting unseen VNR. The RL agent not only has to memorize all the previous variance of the VNR but also to learn and explore more possible variance. It is important to have good and efficient training data for VNR with good variance and train the model with all possible VNRs.

IANA Considerations This memo includes no request to IANA. All drafts are required to have an IANA considerations section (see Guidelines for Writing an IANA Considerations Section in RFCs for a guide). If the draft does not require IANA to do anything, the section contains an explicit statement that this is the case (as above). If there are no requirements for IANA, the section will be removed during conversion into an RFC by the RFC Editor.

Security Considerations All drafts are required to have a security considerations section. See RFC 3552 for a guide.