Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Classification of Scientific Journal Articles for the NIST Thermodynamics Research Center

Published

Author(s)

Alden A. Dima, Kenneth G. Kroenlein, Sharief S. Youssef, Yuanyuan Feng

Abstract

The goal of this work is to explore the feasibility of using open-source-based document classification techniques using features generated via topic modeling on scientific journal articles where there is ground truth as to the relevance of individual articles. This is done in the context of the data curation effort associated with the National Institute of Standards and Technology (NIST) Thermodynamics Research Center (TRC). The TRC generates property recommendations for scientific research and industry through automated analysis of a continuously growing archive of experimental data culled from the literature. Its document classification and curation processes are largely manual and requires that each journal article be examined and evaluated for relevance and to extract key data. Automated scientific journal article classification could substantially reduce the human burden. We present the results of an experiment following a five-factor, full factorial experiment design. The five factors were the corpus size, topic modeling technique, number of topics, use of attribute filtering, and the classifier. The results support our hypothesis that it is possible to build a classifier which uses the document topic mixture vector generated from topic modeling as the input features to classify scientific journal articles as relevant to the TRC data curation efforts. The top five percent of the evaluation runs had an average F score of 0.850 and an 84.7% correct classification rate. The best result was obtained with an AdaBoost M1/J48 decision tree classifier combination that used 100 topic LDA-based document topic mixture vectors as the input. It had an 86.3% correct classification rate and an F score of 0.864. The choice of classifier was the most significant factor in determining the overall classification results. The choice of topic modeling technique was the second most dominant factor followed by the number of topics used during topic modeling. The corpus size and th
Citation
Journal of Research (NIST JRES) -
Created June 7, 2017, Updated November 14, 2018