An important part of the MGI effort is providing infrastructure and tools to enable reproducible research in computational materials science. For reproducible research to become a widely used, repeatable human based process needs to be replaced by automated open-source logging tools. This is especially the case for simulation management, which is often poorly documented and recorded during the development stages of a research project. A good practice is to use a dedicated simulation management tool (SMT) throughout the development process rather than creating an ad-hoc simulation management scheme. Listed below are a number of requirements for an effective SMT.
In order to effectively share provenance data, the SMT must be open source to remove any license issues that could limit the reviewers and collaborators access to the records.
Ideally, the logging and recording process is entirely automated with the only researcher contribution being a small "commit messages" that logs the researcher's thoughts, reasons and outcomes for running the simulation.
Integrated with Version Control
The SMT should be entirely integrated and aware of the common distributed version control (DVC) tools such as Git, Bazaar and Mercurial. The provenance data and simulation data should not be recorded by the version control system, only the SMT project data should be held in version control.
Distributed Simulation Management
Ideally, the SMT will have the ability to clone and sync in an analogous manner to DVC. This is especially important to link with continuous integration tools such as Buildbot.
Support for multiple databases
The SMT should use most common databases (such as Postgres or SQLite) to maintain records provenance data and use a high level tool for accessing the databases such as Django.
A lightweight web interface is required for managing, viewing and sharing simulation records.
The SMT should have a high level interface for querying and pulling out records based on parameter values.
A high level API is required for customizing the management process when the SMT requirements do not meet the researcher's needs.
Hash data files
Output data files should be hashed to enable effective replication and future regression testing with a continuous integration tool.
Integrate low level tests
Low overhead for integration of low level regression tests with each provenance record.
All dependencies should be automatically recorded as well as uninstalled development repositories that the simulation depends on. This is hard to achieve across multiple language barriers, but one of the most important requirements.
The SMT should be aware of the status of live jobs.
Concurrent access to the database is required for batch jobs.
The SMT needs to be aware of provenance data associated with parallel jobs (such as which nodes are being used) as well as awareness of various queuing systems.
It should be easy to output record tables in latex and various markup languages such as HTML, restructured test and markdown for inclusion in blogs, electronic notebooks and other documents.
The SMT should provide a low overhead for specifying records to upload to a continuous integration environment such as Buildbot.
Other Provenance Data
Every record (simulation) should have a unique ID and an associated time stamp.
One particular SMT that is currently being evaluated is Sumatra. It is a is a lightweight system for recording the history and provenance data for numerical simulations. It works particularly well for scientists that are in the intermediate stage between developing a code base and using that code base for active research. This is a common scenario and often results in a mode of development that mixes branching for both code development and production simulations. Using Sumatra avoids this unintended use of the versioning system by providing a lightweight design for recording the provenance data independently from the versioning system used for the code development. The lightweight design of Sumatra fits well with existing ad-hoc patterns of simulation management contrasting with more pervasive workflow tools, which can require a wholesale alteration of work patterns. Sumatra uses a straightforward Django-based data model enabling persistent data storage independently from the Sumatra installation. Sumatra provides a command line utility with a rudimentary web interface, but has the potential to become a full web-based simulation management solution.
Lead Organizational Unit:mml
Related Programs and Projects:
This project supports NIST's efforts in the Materials Genome Initiative
Complete Automation and Distribution of Parallel Simulation Tasks