New resources developed by a multidisciplinary team of experts in natural language processing and data curation and discovery allow researchers to get the most from the COVID-19 Open Research Dataset (CORD-19), a collection of scientific literature about coronaviruses containing tens of thousands of items.
The NIST Scientific Indexing Resource uses the NIST-developed “root and rule” method to identify and index important words (the “roots”) in the dataset, then looks at the relationships between those words and adjacent words using the semantic norms of a language (the “rules”). This method can determine keywords and link words into phrases and concepts to help a user find relevant and related articles without needing to search for a precisely matched keyword or phrase.
The COVID-19 Data Repository relies on the Configurable Data Curation System developed at NIST for structuring datasets that lack organization. The Curator, as it is known, makes the CORD-19 dataset searchable by author, institution, and keyword and allows users to add terms to progressively filter a search. In addition, artificial intelligence researchers can: query the repository’s raw text without writing their own code; download the complete CORD-19 dataset to run their own algorithms; or use the application programming interface to write new programs.
The COVID-19 Registry, also based on the Configurable Data Curation System, is a web application that collects descriptions of resources including other repositories, databases, services, portals, websites, and organizations. Research community members can contribute to the resource registry using a web form or an application programming interface, and resources can be harvested from other registries. It has the potential to develop into a nationwide, comprehensive registry of COVID-19-related resources.
cord19-cdcs-nist, hosted on GitHub, provides quick access to CORD-19 data that is screened for incomplete, irrelevant, or corrupt data, and therefore ready for analysis with any programming language. The companion cv-py collection provides integration with data analysis tools written in Python programming language that work well with the CORD-19 dataset.
The CORD-19 dataset was made available by the Allen Institute for AI, Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft, and the National Library of Medicine at the National Institutes of Health at the request of the White House Office of Science and Technology Policy. These NIST products will be refreshed as the CORD-19 dataset grows. Development of these tools was made possible by the NIST expertise gained through work on bioinformatics, scalable computing and data curation and discovery, and the Materials Genome Initiative, which accelerates development and discovery of new materials through data mining and modelling.