The Materials Genome Initiative (MGI) will create a new era of materials innovation that will serve as a foundation for strengthening domestic materials-related industries. The goal of MGI is to accelerate the United States' ability to discover, develop, manufacture, and deploy advanced materials by a factor of two, at a fraction of the cost.
Sophisticated software tools and widely available high-performance computing continue to be a source of technological advancement. For materials, this has meant a dramatic improvement in modeling capabilities, supporting the notion of "materials by design."
This impressive computationally-fueled growth has brought into focus the need for better data-related tools. Enormous quantities of scientific data are being continuously produced, stretching the limits of historically adequate modes of communication and collaboration. The traditional printed article is no longer adequate for the dissemination of scientific information as it is an insufficient medium for transmission of the results of the many interrelated computational studies that are required for a just single study in the "materials by design" paradigm.
Essential to the success of MGI is the development of a data infrastructure that will provide the needed data and tools to support this effort. The diversity of materials data requires that this data infrastructure be built to accommodate a variety of user needs and data types.
To develop the need data infrastructure and informatics tools, we are currently focusing on three areas: data curation, ontology development, and semantic infrastructure.
- Data curation will become an increasingly important activity as the benefits of computing to the advancement of Materials Science presuppose that the necessary data is available in a machine processable format. To this end, we've been developing the NIST Materials Data Curation System. We are nearing the release of its first version which will include:
- A web-based interface to curate, search, and retrieve materials data
- Support for federated search
- REST-based API for remote access
- Integration with scientific workflows
- Support for semantic queries via a SPARQL endpoint
Future versions will include support for scientific images and graphics and integration with registry systems.
- The scientific and technical communities are increasingly interested in the ontology development as ontologies provide a foundation for large scale efforts by enabling semantic document processing and advanced analytics, by supporting the large-scale combination of diverse information from multiple sources and by accelerating knowledge-intensive activities such as modeling and simulation, and design and engineering. An ontology serves these roles by containing the shared understanding of the objects, concepts, and relationships that are asserted to exist in a domain. Ontology development is similar to software development in that it becomes tractable only with an understanding of the domain in which it will be embedded. Eliciting this domain knowledge from domain experts is often a very tedious, time consuming, and error-prone process. Natural language is a very tempting source for this semantic information as long as the necessary concepts and relationships can be properly extracted. With a corpus of any significance, it is clear that automation is required to properly extract the knowledge. Such automation falls under ontology learning, which is defined as the partial automation of ontology development including the extraction of knowledge from textual sources. We have embarked on an effort to use natural language processing (NLP) techniques to analyze a corpus of several thousand materials-related scientific articles. This effort includes the creation of a distributed system for text extraction from scientific articles in PDF via optical character recognition (OCR) with machine learning-based OCR denoising and the use of NLP tools to identify materials-related terms and relationships from extracted text. The results of this work will support the creation of technical vocabularies for use in curating data and literature as well as community-driven ontology development efforts.
- Other domains, notably the biomedical sciences, have had success with semantic technologies such as the Semantic Web to support the broad access and use of their data. We are investigating the creation of similar semantic infrastructure for MGI. We are exploring the feasibility of using the National Library of Medicine's Semantic Medline with materials literature. The Semantic Medline allows its users to see literature search results as an interconnected graph of knowledge-based relationships and to uncover otherwise hidden connections. Such a capability would be extremely helpful in supporting the MGI goals. We are also working to develop applications which present scientific data in terms of the Semantic Web. Such applications would allow for allow for disparate sources of information to be combined following the Linked Open Data paradigm.
Lead Organizational Unit:
Alden Dima, Information Systems Group