In large-scale distributed systems, the interactions of many independent components may lead to emergent global behaviors with unforeseen, often detrimental, outcomes. The increasing importance of distributed systems such as clouds and computing grids will require analytical tools to understand and predict, complex system behavior to ensure system reliability. In previous work, we described a piece-wise homogeneous Discrete Markov chain representation of a computing grid can be systematically perturbed to predict situations that lead to performance degradations. While the execution time of this approach compared favorably with detailed large-scale simulation, a sizable number of perturbations of the model were needed to identify scenarios in which system performance degraded. Here, we evolve our original approach and describe two novel methods for quickly identifying portions of the Markov chain that are sensitive to perturbation. The first method involves finding minimal s-t cut sets, consisting of state transitions that disconnect all paths in a Markov chain from the initial to a desired end state. By perturbing state transitions in the cut set, it is possible to quickly identify scenarios in which system performance is adversely affected. This method can also be applied to larger Markov models than in our earlier work. We then present a second method, in which the Spectral Expansion Theorem is used to analyze the eigensystem of a set of Markov transition probability matrices to predict which state transitions, if perturbed, are likely to adversely affect system performance. Results are presented for both methods to show that they can be used to identify the same failure scenarios presented in our earlier paper (as well as additional scenarios, using the first method), while reducing the number of perturbations needed. We argue that these methods provide a basis for creating practical tools for analysis of complex systems.
Citation: NIST Interagency/Internal Report (NISTIR) - 7744
NIST Pub Series: NIST Interagency/Internal Report (NISTIR)
Pub Type: NIST PubsReport Number:
complex systems, perturbation analysis, discrete time piece-wise homogenous Markov chain, graph theory, minimal s-t cut set, Spectral Expansion Theorem, eigenvector, eigenvalue, grid computing.