Effectiveness of dataset reduction in testing machine learning algorithms

Raghu N. Kacker; David R. Kuhn

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Effectiveness of dataset reduction in testing machine learning algorithms

Published

August 25, 2020

Author(s)

Raghu N. Kacker, David R. Kuhn

Abstract

Abstract Many machine learning algorithms examine large amounts of data to discover insights from hidden patterns. Testing these algorithms can be expensive and time-consuming. There is a need to speed up the testing process, especially in an agile development process, where testing is frequently performed. One approach is to replace big datasets with smaller datasets produced by random sampling. In this paper, we report a set of experiments that are designed to evaluate the effectiveness of using reduced datasets produced by random sampling for testing machine learning algorithms. In our experiments, we use as subject programs four supervised learning algorithms from the Waikato Environment for Knowledge Analysis (WEKA). We identify five datasets from Kaggle.com to run with the four learning algorithms. For each dataset, we generate reduced datasets of different sizes using two random sampling strategies, i.e., pure random and stratified random sampling. We execute our subject programs with the original and the reduced datasets, and measure test effectiveness using branch and mutation coverage. Our results indicate that in most cases, reduced datasets of even very small sizes can achieve the same or similar coverage achieved by the original dataset. Furthermore, our results indicate that reduced datasets produced by the two sample strategies do not differ significantly, and branch coverage correlates with mutation coverage.

Conference Dates

April 13-16, 2020

Conference Location

OXFORD

Conference Title

IEEE International Conference on Artificial Intelligence Testing

Pub Type

Conferences

Download Paper

https://doi.org/10.1109/AITEST49225.2020.00027

Local Download

Keywords

Testing classifiers, Random sampling, Reduced datasets, Testing machine learning, Branch coverage, Software testing.

Software testing, Cyber-physical systems and Artificial intelligence

Citation

Kacker, R. and Kuhn, D. (2020), Effectiveness of dataset reduction in testing machine learning algorithms, IEEE International Conference on Artificial Intelligence Testing, OXFORD, -1, [online], https://doi.org/10.1109/AITEST49225.2020.00027 (Accessed July 7, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created August 24, 2020, Updated September 18, 2020

Was this page helpful?

Effectiveness of dataset reduction in testing machine learning algorithms

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues