This project aims to develop and evaluate a coherent set of methods to understand behavior in complex information systems, such as the Internet, computational grids and computing clouds. Such large distributed systems exhibit global behavior arising from independent decisions made by many simultaneous actors, which adapt their behavior based on local measurements of system state. Actor adaptations shift the global system state, influencing subsequent measurements, leading to further adaptations. This continuous cycle of measurement and adaptation drives a time-varying global behavior. For this reason, proposed changes in actor decision algorithms must be examined at large spatiotemporal scale in order to predict system behavior. This presents a challenging problem.
What are complex systems? Large collections of interconnected components whose interactions lead to macroscopic behaviors in:
What is the problem? No one understands how to measure, predict or control macroscopic behavior in complex information systems: (1) threatening our nation's security and (2) costing billions of dollars.
"[Despite] society's profound dependence on networks, fundamental knowledge about them is primitive. [G]lobal communication ... networks have quite advanced technological implementations but their behavior under stress still cannot be predicted reliably.... There is no science today that offers the fundamental knowledge necessary to design large complex networks [so] that their behaviors can be predicted prior to building them."
above nist-quote from Network Science 2006, a National Research Council report
What is the new idea? Leverage models and mathematics from the physical sciences to define a systematic method to measure, understand, predict and control macroscopic behavior in the Internet and distributed software systems built on the Internet.
What are the technical objectives? Establish models and analysis methods that (1) are computationally tractable, (2) reveal macroscopic behavior and (3) establish causality. Characterize distributed control techniques, including: (1) economic mechanisms to elicit desired behaviors and (2) biological mechanisms to organize components.
Why is this hard? Valid computationally tractable models that exhibit macroscopic behavior and reveal causality are difficult to devise. Phase-transitions are difficult to predict and control.
Who would care? All designers and users of networks and distributed systems with a 25-year history of unexpected failures:
Businesses and customers who rely on today's information systems:
Designers and users of tomorrow's information systems that will adopt dynamic adaptation as a design principle:
Hard Issues & Plausible Approaches
Hard Issues | Plausible Approaches |
---|---|
H1. Model scale | A1. Scale-reduction techniques |
H2. Model validation | A2. Sensitivity analysis & key comparisons |
H3. Tractable analysis | A3. Cluster analysis and statistical analyses |
H4. Causal analysis | A4. Evaluate analysis techniques |
Model scale – Systems of interest (e.g., Internet and compute grids) extend over large spatiotemporal extent, have global reach, consist of millions of components, and interact through many adaptive mechanisms over various timescales. Scale-reduction techniques must be employed. Which computational models can achieve sufficient spatiotemporal scaling properties? Micro-scale models are not computable at large spatiotemporal scale. Macro-scale models are computable and might exhibit global behavior, but can they reveal causality? Meso-scale models might exhibit global behavior and reveal causality, but are they computable? One plausible approach is to investigate abstract models from the physical sciences. e.g., fluid flows (from hydrodynamics), lattice automata (from gas chemistry), Boolean networks (from biology) and agent automata (from geography). We can apply parallel computing to scale to millions of components and days of simulated time. Scale reduction may also be achieved by adopting n-level experiments coupled for orthogonal fractional factorial (OFF) experiment designs.
Model validation – Scalable models from the physical sciences (e.g., differential nist-equations, cellular automata, nk-Boolean nets) tend to be highly abstract. Can sufficient fidelity be obtained to convince domain experts of the value of insights gained from such abstract models? We can conduct sensitivity analyses to ensure the model exhibits relationships that match known relationships from other accepted models and empirical measurements. Sensitivity analysis also enables us to understand relationships between model parameters and responses. We can also conduct key comparisons along three complementary paths: (1) comparing model data against existing traffic and analysis, (2) comparing results from subsets of macro/meso-scale models against micro-scale models and (3) comparing simulations of distributed control regimes against results from implementations in test facilities, such as the Global Environment for Network Innovations.
Tractable analysis – The scale of potential measurement data is expected to be very large – O(10**15) – with millions of elements, tens of variables, and millions of seconds of simulated time. How can measurement data be analyzed tractably? We could use homogeneous models, which allow one (or a few) elements to be sampled as representative of all. This reduces data volume to 10**6 – 10**7, which is amenable to statistical analyses (e.g., power-spectral density, wavelets, entropy, Kolmogorov complexity) and to visualization. Where homogeneous models are inappropriate, we can use clustering analysis to view relationships among groups of responses. We can also exploit correlation analysis and principal components analysis to identify and exclude redundant responses from collected data. Finally, we can construct combinations of statistical tests and multidimensional data visualization techniques tailored to specific experiments and data of interest.
Causal analysis – Tractable analysis strategies yield coarse data with limited granularity of timescales, variables and spatial extents. Coarseness may reveal macroscopic behavior that is not explainable from the data. For example, an unexpected collapse in the probability density function of job completion times in a computing grid was unexplainable without more detailed data and analysis. Multidimensional analysis can represent system state as a multidimensional space and depict system dynamics through various projections (e.g., slicing, aggregation, scaling). State-space dynamics can segment system dynamics into an attractor-basin field and then monitor trajectories. Markov models providing compact, computationally efficient representations of system behavior can be subjected to perturbation analyses to identify potential failure modes and their causes.
Controlling Behavior – Large distributed systems and networks cannot be subjected to centralized control regimes because the system consists of too many elements, too many parameters, too much change, and too many policies. Can models and analysis methods be used to determine how well decentralized control regimes stimulate desirable system-wide behaviors? Use price feedback (e.g., auctions, present-value analysis or commodity markets) to modulate supply and demand for resources or services. Use biological processes to differentiate function based on environmental feedback, e.g., morphogen gradients, chemotaxis, local and lateral inhibition, polarity inversion, quorum sensing, energy exchange and reinforcement.
Apr 2018 The project defined five potential run-time predictors of network congestion collapse, and evaluated the predictors under two traffic scenarios (increasing and steady loads) for three network models with varying degrees of realism. The project demonstrated that the simplest predictor provided best accuracy and also provided significant warning time. The project also showed that two complicated predictors (autocorrelation and variance) were unreliable, giving many false alerts under steady load.
Dec 2017 The project developed a layered and aggregated queuing network simulation model that can represent behaviors associated with individual packets while achieving increased computational efficiency over discrete-event simulation and also bounding error to a known level. The project demonstrated the application of the layered and aggregated queuing network simulation model to represent behaviors associated with distributed denial of service attacks, showing how such behaviors could be detected with known monitoring techniques, as had been demonstrated previously for discrete-event simulations.
Jul 2017 The project defined new sampling methods to improve the scalability of computation for systems with high dimensional uncertainties. The project demonstrated the application of the sampling methods to determine optimal control decisions and adaptive controls using reinforcement learning. For the demonstration problems, the new sampling methods achieved high accuracy while requiring limited computational resources.
Apr 2016 The project demonstrated that the degree of realism in network simulations influences evolution of network-wide congestion, and also identified key realistic factors that must be included in network simulations in order to draw valid conclusions about spreading network congestion, breakdown in network connectivity, probability of packet delivery, and latencies for successfully delivered packets.
Mar 2015 The project demonstrated that results from a previous study of virtual-machine placement algorithms in computational clouds would not be changed by the injection of asymmetries, dynamics, and failures. This demonstration increased confidence in findings from the previous study.
Dec 2014 The project delivered an effective and scalable method for uncertainty estimation in large-scale simulation models. The method, described in a paper in the proceedings of the 2014 Winter Simulation Conference, can be applied to provide accurate estimation of the value of model responses. The estimation algorithm requires a minimum of computation.
Sep 2014 The project delivered an experiment design and analysis method to determine effective settings for control parameters in evolutionary computation algorithms. The method was documented in a journal article accepted for publication by Evolutionary Computation, MIT Press, which is the leading journal in the field.
Aug 2014 The project delivered a proposal and oral presentation outlining research into methods to provide early warning of network catastrophes. The proposal and oral presentation were part of the FY 2015 NIST competition seeking innovations in measurement science.
Oct 2013 The project delivered an evaluation of a method combining genetic algorithms and simulation to search for failure scenarios in system models. The method was applied to a case study of the Koala cloud computing model. The method was able to discover a known failure cause, but in a novel setting, and was also able to discover several unknown failure scenarios. Subsequently, the method and evaluation were presented at an international workshop on simulation methods, and in two invited lectures, one at Mitre and one at George Mason University.
Dec 2012 In the fall of 2012, Dr. Mills contributed methods from this project to a DoE Office of Science Workshop on Computational Modeling of Big Networks (COMBINE). Dr. Mills also coauthored the report, which was published in December of 2012. The main NIST contributions are documented in Chapter 5 of the report, which outlines effective methods and best practices for experiment design and validation & verification of simulation models.
Nov 2011 In the fall of 2009, this project started investigating large scale behavior in Infrastructure Clouds. The project produced three related papers during 2011, and all three papers were accepted at the two major IEEE cloud computing conferences held during the year. The rapid success of the project in this new domain illustrates the general applicability of the methods we developed, as well as the ease with which those methods can be applied.
Nov 2010 Developed and demonstrated Koala, a discrete-event simulator for Infrastructure Clouds. Completed a sensitivity analysis of Koala to identify unique response dimensions and significant factors driving model behavior. Created multidimensional animations to visualize spatiotemporal variation in resource usage and load for cores, disks, memory and network interfaces in clouds with up to O(10**5) nodes.
May 2010 NIST Special Publication 500-282: Study of Proposed Internet Congestion Control Mechanisms
Sep 2009 Draft NIST Special Publication: Study of Proposed Internet Congestion-Control Mechanisms
Apr 2009 Demonstrated applicability of Markov model perturbation analysis to communication networks.
Sep 2008 Developed a Markov model for a global, computational grid and demonstrated the feasibility of applying perturbation analysis to predict conditions that could lead to performance degradation. Currently, perturbation analysis is a theoretical topic for which we show applications to large distributed systems.
Aug 2008 Developed and demonstrated multidimensional visualization software to explore relationships among complex data sets derived from simulations of large distributed systems. Currently, there are no widely used visualization techniques to explore multidimensional data from simulations of large distributed systems.
Jun 2008 Developed and demonstrated an analytical framework to understand relationships among pricing, admission control and scheduling for resource allocation in computing clusters. Currently, resource-allocation mechanisms for computing clusters rely on heuristics.
Apr 2008 Developed and validated MesoNetHS, which adds six proposed replacement congestion-control algorithms to MesoNet and allows the behavior of the algorithms to be investigated in a large topology. Currently, these congestion-control algorithms are explored in simulated and empirical topologies of small size.
Sep 2007 Developed and demonstrated a methodology for sensitivity analysis of models of large distributed systems. Currently, sensitivity analysis of models for large distributed systems is considered infeasible.
Apr 2007 Developed and verified MesoNet, a mesoscopic scale network simulation model that can be specified with about 20 parameters. Currently, specifying most network simulations requires hundreds to thousands of parameters.
Related Presentations
Related Publications
J. Xie, Y. Wan, K. Mills, J. Filliben, Y. Lei and Z. Lin, "M-PCM-OFFD: An effective output statistics estimation method for systems of high dimensional uncertainties subject to low-order parameter interactions", Mathematics and Computers in Simulation, 159 (2019) 93-118. https://doi.org/10.1016/j.matcom.2018.10.010
V.S. Mai, A. Battou and K. Mills, "Distributed Algorithm for Suppressing Epidemic Spread in Networks", (to appear) in IEEE Control Systems Letters, 2(3), 2018.
C. Dabrowski and K. Mills, "Evaluating Predictors of Congestion Collapse in Communication Networks", Proceedings of the 2018 IEEE/IFIPS Network Operations and Management Symposium, April 24-26, 2018.
J. Xie, C. He, Y. Wan, K. Mills, C. Dabrowski, "A Layered and Aggregated Queuing Network Simulator for Detection of Abnormalities", Proceedings of the Winter Simulation Conference, December 2017.
J. Xie, Y. Wan, K. Mills, J. J. Filliben, F. L. Lewis, “A Scalable Sampling Method to High-dimensional Uncertainties for Optimal and Reinforcement Learning-based Controls”, IEEE Control Systems Letters, Volume 1, No. 1, Pages:98-103, July, 2017. Online ISSN 2475-1456, IEEE. DOI: 10.1109/LCSYS.2017.2708598
C. Dabrowski and K. Mills, "Using Realistic Factors to Simulate Catastrophic Congestion Events in a Network", Computer Communications 112 (2017) 93-108. DOI: 10.1016/j.comcom.2017.08.006
Related Software Tools
Related Demonstrations
Other Information