The Configurable Data Curation System (CDCS) is an informatics platform created as critical data-infrastructure for materials science R&D. Although initially conceived under the Materials Genome Initiative (MGI) program to accelerate advanced materials innovation, design, and deployment, the CDCS has been finding increasing use in scientific projects, organizations, institutions, and other domains domestically and internationally.
The ability to automate and accelerate the activities of scientific and engineering lifecycles for materials science (or any other domain) depends critically on a scalable infrastructure for scientific data. Without appropriate data or inter-connection of that data, no meaningful automation or interpretation is possible. In the MGI, there may be collections of incompatible data often represented in diverse formats. This is a challenge to the distributed research goal envisaged by the MGI. The Configurable Data Curation System (CDCS) allows for the curation of materials data into a repository using predefined templates. The ability of the platform’s underlying XML format to be transformed into virtually any other format using standard tools gives the CDCS the ability to serve as a data source for a wide variety of existing materials informatics efforts that can span across projects, groups, and organizations. Each project, group, or organization can run as many MDCS instances as needed. Individual MDCS repositories can be interconnected for federated searches and data sharing.
The informatics platform created by the NIST informatics team is a scalable data management platform, whose system types (curator and registry) represent basic building blocks of MGI infrastructure for activities involving data, computation, integration, and R&D. Built as web-applications made of modular functional components, the CDCS platform and team has been continually and successfully realizing the data-infrastructure aspect of the MGI vision by providing a scalable basis for incrementally curating, aggregating, connecting, searching, and sharing data, resources, and infrastructure. This has been built on a stack of modern web, data, and informatics technologies including:
The CDCS is implemented in Python, the Django web-application framework and MongoDB. It uses XML because it is a robust, proven standard written as plain text. It can also be shared and converted into other formats easily. The CDCS provides a Representational State Transfer (REST) API that allows other software to directly interact with it over a network. CDCS functions are available via the API, allowing for full automation.
Features and capabilities available in fielded systems include:
Common use-cases supported by the CDCS include:
Since its rearchitecture in 2017 into the 2.0 core modular system of packages, the team has had 31 releases of the curator (MDCS) and 24 releases of the registry (MRR) software components and nearly 50 modular components, as well as 6 primary releases before that since the first public release in 2015. Releases have been developed in close-collaboration with a growing base of stakeholders who suggest important features that are rapidly developed and deployed on both test and production systems.
The user community for the CDCS has an increasing footprint in the U.S. (government, industry, and academia) as well as internationally. CDCS systems are being used to support research at national research institutions in Switzerland (the Swiss Government’s Research Institute for Materials Science and Technology or EMPA), Sweden (Stockholm’s KTH Royal Institute of Technology), Taiwan (University of Taiwan), China (Shanghai University), Japan (National Institute for Materials Science or NIMS), Korea (Korea Institute of Materials Science or KIMS). In the United States, it is being used extensively at the National Institute for Standards and Technology (NIST) for projects such as smart manufacturing, additive manufacturing, inter-atomic potentials research, phase-data research, high-throughput materials science, and more. In addition, it is being used by the U.S. Army Research Lab, Hollings Marine Laboratory, Argonne National Laboratory, John’s Hopkins University, Duke University, Texas A&M, Missouri S&T, Northwest University, and more for materials science R&D. It is being used at prominent industry organizations such as NextFlex, America Makes, and QuesTek, and is also showing impact in domains well-beyond materials science, such as uses at NIH to support human genome bio-informatics research, and in the greenhouse-gases research community. It also has domestic and international deployments of its registries to support scientific discovery at prominent institutions such as the world-recognized metrology organization, Bureau International des Poids et Mesures (BIPM), the Research Data Alliance (RDA), the Center for Hierarchical Materials Design (CHiMaD), as well as a primary registry deployment at the National Institute of Standards and Technology (NIST).
Recently we have seen rapid application, integration, and scaling of CDCS to the COVID-19 research domain. The NIST COVID19-DATA repository and registry systems are being made available to aid in meeting the White House Call to Action for the Nation’s artificial intelligence experts to develop new text and data mining techniques that can help the science community address high-priority scientific questions associated with COVID-19.
The CDCS is actively developed in collaboration with MML and ODI and has been regularly listed as part of the NIST-wide initiatives for improving access to open data and supporting the NIST process systems for developing and producing scientific reference data. Thus, CDCS remains close to conversations surrounding high quality science and analysis, such as scientific reproducibility and provenance.
Continual engagement is successfully and gradually growing and integrating communities of scientific data by deploying and connecting registries and repositories into scientific workflows in materials science, bio-economy, greenhouse gases, international metrology, international research data working groups (RDA), as well as integrating with other existing data infrastructure.