In recent years, substantial research has been devoted to monitoring and predicting performance degradations in real-world complex systems within large entities such as nuclear power plants, electrical grids, and distributed computing systems. Special challenges are posed by the fact that such systems operate in uncertain environments, are highly dynamic, and can exhibit emergent behaviors that can lead to catastrophic failure. Discrete Time Markov chains (DTMCs) have been an important area of focus of this research, because they represent dynamic behavior succinctly, provide a means to measure uncertainty, and can model long-term system evolution, i.e., can be extended to be time-inhomogeneous. Moreover, DTMCs provide a means to measure potential changes to system performance. To date, DTMCs have been proposed for tasks such as fault detection and long-term condition equipment monitoring in real-world complex systems. However, the scope of these models has generally been restricted to describing states that directly concern fault conditions. Less work has been done on using DTMCs to represent a more complete range of states a complex system may enter into during normal operation. Such comprehensive, detailed models allow a system to be analyzed in the context of normal operation in order to understand more precisely how evolution into undesirable states occurs. This paper describes progress made on developing an approach for using larger, more detailed DTMC models to find potential failure scenarios in operational complex systems. The approach uses a combination of methods to perturb a DTMC, simulate alternative system evolutions, and identify scenarios in which a system may descend into failure. Key to the approach is the use of graph theory techniques to reduce the size of the search space of potential alternative behaviors to be explored. An example is provided of using a DTMC of significant size to predict failure in a distributed resource allocation system.
Proceedings Title: Proceedings of the 2011 American Society of Mechanical Engineers (ASME) Pressure Vessels & Piping Division (PVPD) Conference
Conference Dates: July 7, 2011
Conference Location: Baltimore, MD
Conference Title: American Society of Mechanical Engineers (ASME) 2011 Pressure Vessels & Piping Division (PVPD) Conference
Pub Type: Conferences
Complex system, Discrete Time Markov chain, time-inhomogeneous Markov chain, matrix perturbation