NIST logo

Robustness and Reliability of Grid Systems

Overview:  The goal of this project is to develop measurement methods for evaluating the reliability and robustness of grid-computing systems that utilize the emerging Open Grid Forum (OGF) standards and related specifications. Grid computing systems enable dynamic composition of large numbers of distributed resources to perform highly compute-intensive tasks. These resources include processors, software components, memory and disk storage, high-speed data transfer capabilities, and databases. In recent years, there has been rapid growth in the number of industrial grid systems developed to support applications such as electronic commerce and finance, engineering design, product development, and scientific research. As demand grows and development of this technology continues, industrial grid systems that implement standards solutions will require methods to measure, analyze, and manage increasingly greater numbers of grid resources in order to ensure system reliability under volatile and uncertain conditions.

Industry Need Addressed: The rapid growth of commercial grid computing depends upon the success of standards currently being developed by the OGF and similar organizations. While these standards focus on providing a common platform for executing core grid resource management functions, less attention has been devoted to understanding how standards-based grid systems might behave at a large scale, or how well they might respond under volatile and uncertain conditions. At large scales, interactions among grid components can lead to complex, non-linear behaviors that produce unsuspected and uncontrolled system-wide effects, which in turn, can endanger and severely degrade effectiveness of an industrial grid system. The ability of grid systems to exhibit reliability and robustness in the face of such conditions is important; otherwise, significant productivity losses are likely to occur and long-term technological progress will be endangered. To ensure the required reliability of large-scale grid systems, the development of measurement methods is needed for analysis and management.

NIST/ITL Role: AAs industry focuses on establishing basic grid capabilities, NIST/ITL assists the private sector by developing methods to measure system behaviors and characteristics that impact reliability and robustness in large-scale, standards-based grid computing systems. These methods measure the ability of grids to provide services to industrial applications in the face of volatile and uncertain conditions. The methods will also enable understanding of causes of complex behavior in grid systems, detection of the onset of undesirable system states, and ultimately control to promote desirable system states. Specifically, the work provides:

  1. A simulation framework for modeling different approaches to resource management and control that currently underlie grid computing specifications,
  2. A set of metrics, scenarios, and methods against which to measure and evaluate robustness and reliability of the proposed approaches, with sample evaluations,
  3. Control algorithms that facilitate desirable overall grid system behaviors and
  4. Identification of issues and requirments for grid system reliability, developed through the OGF Reliability and Robustness Research Group.

Impact: NIST/ITL, through publication, interaction with industry, and participation in the standards bodies, provides critical information to developers of commercial grid computing standards and applications. Such information should suggest new approaches to improving reliability and robustness under volatile conditions.