Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline

Roselyne B. Tchoua; Kyle Chard; Debra Audus; Logan T. Ward; Lequieu Joshua; Juan J. de Pablo; Ian Foster

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline

Published

November 27, 2017

Author(s)

Roselyne B. Tchoua, Kyle Chard, Debra Audus, Logan T. Ward, Lequieu Joshua, Juan J. de Pablo, Ian Foster

Abstract

The emerging field of materials informatics has the potential to greatly reduce time-to-market and development costs for new materials. The success of such efforts hinges on access to large, high-quality databases of material properties. However, many such data are only to be found encoded in text within esoteric scientific articles, a situation that makes automated extraction difficult and manual extraction time-consuming and error-prone. To address this challenge, we present a hybrid Information Extraction (IE) pipeline to improve the machinehuman partnership with respect to extraction quality and personhours, through a combination of rule-based, machine learning, and crowdsourcing approaches. Our goal is to leverage computer and human strengths to alleviate the burden on human curators by automating initial extraction tasks before prioritizing and assigning specialized curation tasks to humans with different levels of training: using non-experts for straightforward tasks such as validation of higher accuracy results (e.g., completing partial facts) and domain experts for low-certainty results (e.g., reviewing specialized compound labels). To validate our approaches, we focus on the task of extracting the glass transition temperature of polymers from published articles. Applying our approaches to 6 090 articles, we have so far extracted 259 refined data values. We project that this number will grow considerably as we tune our methods and process more articles, to exceed that found in standard, expert-curated polymer data handbooks while also being easier to keep up-to-date. The freely available data can be found on our Polymer Properties Predictor and Database website at http://pppdb.uchicago.edu.

Proceedings Title

2017 IEEE 13th International Conference on e-Science (e-Science)

Conference Dates

October 24-27, 2017

Conference Location

Auckland, NZ

Conference Title

2017 IEEE 13th International Conference on e-Science

Pub Type

Conferences

Download Paper

https://doi.org/10.1109/eScience.2017.23

Local Download

Polymers and Artificial intelligence

Citation

Tchoua, R. , Chard, K. , Audus, D. , Ward, L. , Joshua, L. , de Pablo, J. and Foster, I. (2017), Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline, 2017 IEEE 13th International Conference on e-Science (e-Science), Auckland, NZ, [online], https://doi.org/10.1109/eScience.2017.23, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=923676 (Accessed July 15, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created November 26, 2017, Updated January 4, 2022

Was this page helpful?