Exploiting redundancy in large materials datasets for efficient machine learning with less data

Kamal Choudhary; Brian DeCost; Kangming Li; Daniel "Persaud "; Jason Hattrick-Simpers; Michael Greenwood

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Published

November 10, 2023

Author(s)

Kamal Choudhary, Brian DeCost, Kangming Li, Daniel "Persaud ", Jason Hattrick-Simpers, Michael Greenwood

Abstract

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the "bigger is better" mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Citation

npj Computational Materials

Pub Type

Journals

Download Paper

https://doi.org/10.1038/s41467-023-42992-y

Local Download

Materials and Chemistry

Citation

Choudhary, K. , DeCost, B. , LI, K. , "Persaud ", D. , Hattrick-Simpers, J. and Greenwood, M. (2023), Exploiting redundancy in large materials datasets for efficient machine learning with less data, npj Computational Materials, [online], https://doi.org/10.1038/s41467-023-42992-y, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=936676 (Accessed July 15, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created November 10, 2023, Updated May 17, 2024

Was this page helpful?