To develop the need data infrastructure and informatics tools, we are currently focusing on three areas: data curation, ontology development, and semantic infrastructure.
- Data curation will become an increasingly important activity as the benefits of computing to the advancement of Materials Science presuppose that the necessary data is available in a machine processable format. To this end, we've been developing the NIST Materials Data Curation System. We are nearing the release of its first version which will include:
- A web-based interface to curate, search, and retrieve materials data
- Support for federated search
- REST-based API for remote access
- Integration with scientific workflows
- Support for semantic queries via a SPARQL endpoint
Future versions will include support for scientific images and graphics and integration with registry systems.
- The scientific and technical communities are increasingly interested in the ontology development as ontologies provide a foundation for large scale efforts by enabling semantic document processing and advanced analytics, by supporting the large-scale combination of diverse information from multiple sources and by accelerating knowledge-intensive activities such as modeling and simulation, and design and engineering. An ontology serves these roles by containing the shared understanding of the objects, concepts, and relationships that are asserted to exist in a domain. Ontology development is similar to software development in that it becomes tractable only with an understanding of the domain in which it will be embedded. Eliciting this domain knowledge from domain experts is often a very tedious, time consuming, and error-prone process. Natural language is a very tempting source for this semantic information as long as the necessary concepts and relationships can be properly extracted. With a corpus of any significance, it is clear that automation is required to properly extract the knowledge. Such automation falls under ontology learning, which is defined as the partial automation of ontology development including the extraction of knowledge from textual sources. We have embarked on an effort to use natural language processing (NLP) techniques to analyze a corpus of several thousand materials-related scientific articles. This effort includes the creation of a distributed system for text extraction from scientific articles in PDF via optical character recognition (OCR) with machine learning-based OCR denoising and the use of NLP tools to identify materials-related terms and relationships from extracted text. The results of this work will support the creation of technical vocabularies for use in curating data and literature as well as community-driven ontology development efforts.
- Other domains, notably the biomedical sciences, have had success with semantic technologies such as the Semantic Web to support the broad access and use of their data. We are investigating the creation of similar semantic infrastructure for MGI. We are exploring the feasibility of using the National Library of Medicine's Semantic Medline with materials literature. The Semantic Medline allows its users to see literature search results as an interconnected graph of knowledge-based relationships and to uncover otherwise hidden connections. Such a capability would be extremely helpful in supporting the MGI goals. We are also working to develop applications which present scientific data in terms of the Semantic Web. Such applications would allow for allow for disparate sources of information to be combined following the Linked Open Data paradigm.