Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort

Roselyne B. Tchoua; Aswathy Ajith; Zhi Hong; Logan T. Ward; Kyle Chard; Debra Audus; Shrayesh N. Patel; Juan J. de Pablo; Ian Foster

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort

Published

June 8, 2019

Author(s)

Roselyne B. Tchoua, Aswathy Ajith, Zhi Hong, Logan T. Ward, Kyle Chard, Debra Audus, Shrayesh N. Patel, Juan J. de Pablo, Ian Foster

Abstract

Scientific Named Entity Referent Extraction is often more complicated than traditional Named Entity Recognition (NER). For example, in polymer science, chemical structure may be encoded in a variety of nonstandard naming conventions, and authors may refer to polymers with conventional names, commonly used names, labels (in lieu of longer names), synonyms, and acronyms. As a result, accurate scientific NER methods are often based on task- specific rules, which are difficult to develop and maintain, and are not easily generalized to other tasks and fields. Machine learning models require substantial expert-annotated data for training. Here we propose polyNER: a semi- automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. Evaluation on materials science publications shows that the polyNER approach enables improved precision or recall relative to a state-of-the-art chemical entity extraction system at a dramatically lower cost: it required just two hours of expert time, rather than extensive and expensive rule engineering, to achieve that result. This result highlights the potential for human-computer partnership for constructing domain-specific scientific NER systems.

Proceedings Title

Computational Science ICCS 2019

Conference Dates

October 29-November 1, 2018

Conference Location

Algarve, PT

Pub Type

Conferences

Download Paper

Local Download

Polymers and Natural language processing

Citation

Tchoua, R. , Ajith, A. , Hong, Z. , Ward, L. , Chard, K. , Audus, D. , Patel, S. , de Pablo, J. and Foster, I. (2019), Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort, Computational Science ICCS 2019, Algarve, PT, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=926228 (Accessed July 17, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created June 7, 2019, Updated October 12, 2021

Was this page helpful?