**Sponsor: National Science Foundation (CNS-0830961)
PI: Eytan Modiano; Dates: 9/2008 – 8/2012**

## Project overview

Networks often rely on the electronic layers to provide most protection and restoration services. However, in a layered network, the protection mechanisms provided at the electronic layer may not be robust in the face of failures in the underlying optical layer. For example, SONET networks typically provide protection against single link failures using a ring network architecture, and protection in general “mesh” networks (e.g., ATM, WDM) is typically provided using disjoint paths. However, even electronic topologies that are designed to be tolerant of single link failures, once they are embedded on the physical (e.g., fiber) topology, may no longer be survivable to single physical (fiber) link failures. This is because the failure of a single fiber link can lead to the failure of multiple links in the electronic topology, which may subsequently leave the electronic topology disconnected. Thus, network survivability mechanisms often cannot provide their claimed level of protection and restoration, when embedded on a physical topology.

## Summary of results

**I. Metrics for layered network survivability**

In [1] we show that standard survivability metrics, such as graph connectivity, have different meaning in the context of layered network architectures. Thus, we developed new metrics that capture the interactions across layers. These metrics will allow us to compare the survivability performance of different network architectures, and provide insights into the design of cross-layer network architectures that are highly survivable.

In particualr, we introduce the Min Cross Layer Cut as a new metric for measuring the survivability of layered networks. Min Cross Layer Cut is a natural extension of the single layer min-cut metric and corresponds to the minimum number of physical fiber cuts that would disconnect the logical topology of the layered network. We show that in the layered setting, fundamental theorems such as the max-flow-min-cut theorem no longer hold. We then develop new mechanisms for routing the logical links of the network on the physical toplogy so as to maximize the Min Cross Layer Cut. This allows the resulting cross-layer architecture to be resilient to failures across layers.

**II. Reliability in layered networks with probabilistic failures**

In [2] we develop diverse routing schemes for dealing with multiple, possibly correlated, failures. While disjoint path protection can effectively deal with isolated single link failures, recovering from multiple failures is not guaranteed. In particular, events such as natural disasters or intentional attacks can lead to multiple correlated failures, for which recovery mechanisms are not well understood. We take a probabilistic view of network failures where multiple failure events can occur simultaneously, and develop algorithms for finding diverse routes with minimum joint failure probability. Moreover, we develop a novel Probabilistic Shared Risk Link Group (PSRLG) framework for modeling correlated failures. In this context, we formulate the problem of finding two paths with minimum joint failure probability as an Integer Non-Linear Program (INLP), and develop approximations and linear relaxations that can find nearly optimal solutions in most cases.

In [3] we consider network reliability in layered networks where the lower layer experiences random link failures. In layered networks, each failure at the lower layer may lead to multiple failures at the upper layer. We generalize the classical polynomial expression for network reliability to the multi-layer setting. Using random sampling techniques, we develop polynomial time approximation algorithms for the failure polynomial. Our approach gives an approximate expression for reliability as a function of the link failure probability, eliminating the need to resample for different values of the failure probability. Furthermore, it gives insight on how the routings of the logical topology on the physical topology impact network reliability. We show that maximizing the min cut of the (layered) network maximizes reliability in the low failure probability regime. Based on this observation, we develop algorithms for routing the logical topology to maximize reliability.

In [4] we consider the impact of random failures that are geographically correlated on network connectivity. Fiber-optic networks are vulnerable to natural disasters, such as tornadoes or earthquakes, as well as to physical failures, such as an anchor cutting underwater fiber cables. Such real-world events occur in specific geographical locations and disrupt specific parts of the network. Therefore, the geography of the network determines the effect of physical events on the network’s connectivity and capacity. In [4] we developed tools to analyze network reliability after a ‘random’ geographic disaster. In particular, we consider disasters that take the form of a ‘random’ line in a plane. Using results from geometric probability, we are able to calculate some network performance metrics to such a disaster in polynomial time. In particular, we can evaluate average two-terminal reliability in polynomial time under ‘random’ line-cuts. This is in contrast to the case of independent link failures for which there exists no known polynomial time algorithm to calculate this reliability metric. Our novel approach provides a promising new direction for modeling and designing networks to lessen the effects of geographical disasters or attacks.

**III. Disjoint path routing in layered networks**

In [5] we consider the problem of survivability in multilayer networks. Our objective is to find a set of paths for a given source-destination pair that will survive any single link failure. In single layered networks, a pair of disjoint paths can be used to provide protection. However, disjoint paths may not exist in layered networks. In [5] we study the problem of finding paths that may not be disjoint but together will survive any single failure. First, we address the problem of finding the minimum number of survivable paths. Then, we focus on two versions of this problem: one where the length of a path is restricted, and another where the number of paths sharing a fiber is restricted. We prove that in general, finding the minimum survivable paths set is NP-hard. However, both restricted versions of the problem can be solved in polynomial time. We present several Integer Linear Programming (ILP) formulations, and develop heuristics and approximation algorithms. We also consider the problem of finding a set of survivable paths that uses the minimum number of fibers. We show that this problem is NP-hard in general, and develop heuristics and approximation algorithms with provable approximation bounds.

In [6] we developed a scheme in which a dedicated backup network is designed to provide protection from random link failures. Upon a link failure in the primary network, traffic is rerouted through a preplanned path in the backup network. We introduce a novel approach for dealing with random link failures, in which probabilistic survivability guarantees are provided to limit capacity over-provisioning. We show that the optimal backup routing strategy in this respect depends on the reliability of the primary network. Specifically, as primary links become less likely to fail, the optimal backup networks employ more resource sharing amongst backup paths. We apply results from the field of robust optimization to formulate an ILP for the design and capacity provisioning of these backup networks. We also propose a simulated annealing heuristic to solve this problem for large-scale networks. We use extensive simulation results to verify our analysis and approach, and demonstrate significant capacity savings through the use of the robust optimization approach. In [7] we apply a similar approach to networks where the traffic demand is stochastic, and develop capacity allocation scheme, using robust optimization, that is robust to random fluctuation in the end-user demand.

**IV. Enhancing reliability through lightpath routing**

In [8] we study the reliability maximization problem in WDM networks with random link failures. Reliability in these networks is defined as the probability that the logical network is connected, and it is determined by the underlying lightpath routing and the link failure probability. We show that in general the optimal lightpath routing depends on the link failure probability, and characterize the properties of lightpath routings that maximize the reliability in different failure probability regimes. In particular, we show that in the low failure probability regime, maximizing the “cross-layer” min cut of the (layered) network maximizes reliability, whereas in the high failure probability regime, minimizing the spanning tree of the network maximizes reliability. Motivated by these results, we develop lightpath routing algorithms for reliability maximization.

**V. Providing protection using recovery domains**

In [10] we consider the problem of providing network protection that guarantees the maximum amount of time that flow can be interrupted after a failure. This is in contrast to schemes that offer no recovery time guarantees, such as IP rerouting, or the prevalent local recovery scheme of Fast ReRoute, which often over-provisions resources to meet recovery time constraints.

To meet these recovery time guarantees, we provide a novel and flexible solution by partitioning the network into failure independent “recovery domains”, where within each domain, the maximum amount of time to recover from a failure is guaranteed. We develop an optimal solution in the form of an MILP for both the case when backup capacity can and cannot be shared. This approach provides protection with guaranteed recovery times using up to 45% less protection resources than local recovery. In a related work [9] we also develop schemes that provide availability guarantees to different sessions by proper choice of the recovery path, and the spare capacity allocation.

**VI. Participants**

Prof. Eytan Modiano, PI

Dr. Hyang-Won Lee, postdoc

Dr. Kayi Lee, graduate student, now at Google Research

Dr. Sebastian Neumayer, graduate student, now at MIT Lincoln Laboratory

Marzieh Parandehgheibi, graduate student

Matt Johnston, graduate student

Greg Kuperman, graduate student

**Publications**

[1] Kayi Lee and Eytan Modiano, “Cross-Layer Survivability in WDM-based Networks,” IEEE Infocom, Rio De Janeiro, April 2009. Extended version in IEEE/ACM Transactions on Networking, 2011.

[2] Hyang-Won Lee and Eytan Modiano, “Diverse Routing in Networks with Probabilistic Failures,” IEEE Infocom, Rio De Janeiro, April 2009. Extended version in IEEE/ACM Transaction on Networking, December, 2010.

[3] Kayi Lee, Hyang-Won Lee and Eytan Modiano, “Reliability in Layered Networks with Random Link Failures,”, IEEE Infocom, San Diego, March 2010. Extended version in IEEE/ACM Transactions on Networking, 2011.

[4] Sebastian Neumayer and Eytan Modiano, “Network Reliability with Geographically Correlated Failures,” IEEE INFOCOM 2010, San Diego, CA. Extended version in IEEE Transactions on Networking, 2012.

[5] Marzieh Parandehgheibi, Hyang-Won Lee and Eytan Modiano,”Survivable Paths in Multilayer Networks,” in Conference on Information Science and Systems, March, 2012.

[6] Matthew Johnston, Hyang-Won Lee, Eytan Modiano, “A Robust Optimization Approach to Backup Network Design with Random Failures,” IEEE Infocom, Shanghai, China, April 2011.

[7] M. Johnston, H.W. Lee, E. Modiano, “Robust Network Design for Stochastic Traffic Demands,” IEEE Globecom, Next Generation Networking Symposium, Houston, TX, December 2011.

[8] Kayi Lee, Hyang-Won Lee and Eytan Modiano, “Maximizing Reliability in WDM Networks through Lightpath Routing,” IEEE Globecom, December, 2011.

[9] Greg Kuperman, Eytan Modiano, Aradhana Narula-Tam, “Network Protection with Multiple Availability Guarantees,” IEEE ICC workshop on New Trends in Optical Networks Survivability, June 2012.

[10] Greg Kuperman, Eytan Modiano “Network Protection with Guaranteed Recovery Times using Recovery Domains,” Submitted to IEEE Infocom 2013.