Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Effectiveness of dataset reduction in testing machine learning algorithms

Published

Author(s)

Raghu N. Kacker, David R. Kuhn

Abstract

Abstract— Many machine learning algorithms examine large amounts of data to discover insights from hidden patterns. Testing these algorithms can be expensive and time-consuming. There is a need to speed up the testing process, especially in an agile development process, where testing is frequently performed. One approach is to replace big datasets with smaller datasets produced by random sampling. In this paper, we report a set of experiments that are designed to evaluate the effectiveness of using reduced datasets produced by random sampling for testing machine learning algorithms. In our experiments, we use as subject programs four supervised learning algorithms from the Waikato Environment for Knowledge Analysis (WEKA). We identify five datasets from Kaggle.com to run with the four learning algorithms. For each dataset, we generate reduced datasets of different sizes using two random sampling strategies, i.e., pure random and stratified random sampling. We execute our subject programs with the original and the reduced datasets, and measure test effectiveness using branch and mutation coverage. Our results indicate that in most cases, reduced datasets of even very small sizes can achieve the same or similar coverage achieved by the original dataset. Furthermore, our results indicate that reduced datasets produced by the two sample strategies do not differ significantly, and branch coverage correlates with mutation coverage.
Conference Dates
April 13-16, 2020
Conference Location
OXFORD
Conference Title
IEEE International Conference on Artificial Intelligence Testing

Keywords

Testing classifiers, Random sampling, Reduced datasets, Testing machine learning, Branch coverage, Software testing.

Citation

Kacker, R. and Kuhn, D. (2020), Effectiveness of dataset reduction in testing machine learning algorithms, IEEE International Conference on Artificial Intelligence Testing, OXFORD, -1, [online], https://doi.org/10.1109/AITEST49225.2020.00027 (Accessed April 19, 2024)
Created August 24, 2020, Updated September 18, 2020