Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline

Published

Author(s)

Roselyne B. Tchoua, Kyle Chard, Debra Audus, Logan T. Ward, Lequieu Joshua, Juan J. de Pablo, Ian Foster

Abstract

The emerging field of materials informatics has the potential to greatly reduce time-to-market and development costs for new materials. The success of such efforts hinges on access to large, high-quality databases of material properties. However, many such data are only to be found encoded in text within esoteric scientific articles, a situation that makes automated extraction difficult and manual extraction time-consuming and error-prone. To address this challenge, we present a hybrid Information Extraction (IE) pipeline to improve the machinehuman partnership with respect to extraction quality and personhours, through a combination of rule-based, machine learning, and crowdsourcing approaches. Our goal is to leverage computer and human strengths to alleviate the burden on human curators by automating initial extraction tasks before prioritizing and assigning specialized curation tasks to humans with different levels of training: using non-experts for straightforward tasks such as validation of higher accuracy results (e.g., completing partial facts) and domain experts for low-certainty results (e.g., reviewing specialized compound labels). To validate our approaches, we focus on the task of extracting the glass transition temperature of polymers from published articles. Applying our approaches to 6 090 articles, we have so far extracted 259 refined data values. We project that this number will grow considerably as we tune our methods and process more articles, to exceed that found in standard, expert-curated polymer data handbooks while also being easier to keep up-to-date. The freely available data can be found on our Polymer Properties Predictor and Database website at http://pppdb.uchicago.edu.
Proceedings Title
2017 IEEE 13th International Conference on e-Science (e-Science)
Conference Dates
October 24-27, 2017
Conference Location
Auckland, NZ
Conference Title
2017 IEEE 13th International Conference on e-Science

Citation

Tchoua, R. , Chard, K. , Audus, D. , Ward, L. , Joshua, L. , de Pablo, J. and Foster, I. (2017), Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline, 2017 IEEE 13th International Conference on e-Science (e-Science), Auckland, NZ, [online], https://doi.org/10.1109/eScience.2017.23, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=923676 (Accessed April 19, 2024)
Created November 26, 2017, Updated January 4, 2022