Over the past five years, we investigated methods to characterize global behavior in large distributed systems and applied those methods to predict effects from deploying alternate distributed control algorithms. The methods we used assess global behaviors under a wide range of conditions, enable significant understanding of overall system dynamics, and yield insightful comparisons of competing control regimes. On the other hand, such methods do not provide information about potential for rare combinations of events to drive system dynamics into global failure regimes, leading to catastrophic collapse. Our ongoing research aims to address this topic using two complementary thrusts: (1) design-time methods that enable system architects to identify and evaluate global failure scenarios that could lead to system collapse and (2) run-time methods that alert system operators about incipient transition to global failure regimes, and subsequent collapse. Effective design-time methods will enable architects to devise mechanisms that can prevent high-risk scenarios. Since no design-time methods can identify all possible failure scenarios, effective run-time methods will signal operators when system trajectory trends toward collapse, allowing remedial actions to forestall or mitigate catastrophic failure. In this short contribution, we reprise our previous work on methods to characterize global system dynamics and compare alternate control regimes, and then describe our ongoing work toward design-time and run-time methods for predicting global failure regimes.
Pub Type: Talks
complex systems, failure prediction, global behavior