The Configurable Data Curation System (CDCS) is an informatics platform created as critical data-infrastructure for materials science R&D. Although initially conceived under the Materials Genome Initiative (MGI) program to accelerate advanced materials innovation, design, and deployment, the CDCS has been finding increasing use in scientific projects, organizations, institutions, and other domains domestically and internationally.
The ability to automate and accelerate the activities of scientific and engineering lifecycles for materials science (or any other domain) depends critically on a scalable infrastructure for scientific data. Without appropriate data or inter-connection of that data, no meaningful automation or interpretation is possible. In the MGI, there may be collections of incompatible data often represented in diverse formats. This is a challenge to the distributed research goal envisaged by the MGI. The Configurable Data Curation System (CDCS) allows for the curation of materials data into a repository using predefined templates. The ability of the platform’s underlying XML format to be transformed into virtually any other format using standard tools gives the CDCS the ability to serve as a data source for a wide variety of existing materials informatics efforts that can span across projects, groups, and organizations. Each project, group, or organization can run as many MDCS instances as needed. Individual MDCS repositories can be interconnected for federated searches and data sharing.
The informatics platform created by the NIST informatics team is a scalable data management platform, whose system types (curator and registry) represent basic building blocks of MGI infrastructure for activities involving data, computation, integration, and R&D. Built as web-applications made of modular functional components, the CDCS platform and team has been continually and successfully realizing the data-infrastructure aspect of the MGI vision by providing a scalable basis for incrementally curating, aggregating, connecting, searching, and sharing data, resources, and infrastructure. This has been built on a stack of modern web, data, and informatics technologies including:
The CDCS is implemented in Python, the Django web-application framework and MongoDB. It uses XML because it is a robust, proven standard written as plain text. It can also be shared and converted into other formats easily. The CDCS provides a Representational State Transfer (REST) API that allows other software to directly interact with it over a network. CDCS functions are available via the API, allowing for full automation.
Features and capabilities available in fielded systems include:
Common use-cases supported by the CDCS include: