Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Data Frequency Coverage Impact on AI Performance

Published

Author(s)

Erin Lanus, Brian Lee, Jaganmohan Chandrasekaran, Laura Freeman, M S Raunak, Raghu Kacker, David Kuhn

Abstract

Artificial Intelligence (AI) models use statistical learning over data to solve complex problems for which straightforward rules or algorithms may be difficult or impossible to design; however, a side effect is that models that are complex enough to sufficiently represent the function may be uninterpretable. Combinatorial testing, a black-box approach arising from software testing, has been applied to test AI models. A key differentiator between traditional software and AI is that many traditional software faults are deterministic, requiring a failure-inducing combination of inputs to appear only once in the test set for it to be discovered. On the other hand, AI models learn statistically by reinforcing weights through repeated appearances in the training dataset, and the frequency of input combinations plays a significant role in influencing the model's behavior. Thus, a single occurrence of a combination of feature values may not be sufficient to influence the model's behavior. Consequently, measures like simple combinatorial coverage that are applicable to software testing do not capture the frequency with which interactions are covered in the AI model's input space. This work develops methods to characterize the data frequency coverage of feature interactions in training datasets and analyze the impact of imbalance, or skew, in the combinatorial frequency coverage of the training data on model performance. We demonstrate our methods with experiments on an open-source dataset using several classical machine learning algorithms. This pilot study makes three observations: performance may increase or decrease with data skew, feature importance methods do not predict skew impact, and adding more data may not mitigate skew effects.
Proceedings Title
Intl Workshop on Combinatorial Testing
Conference Dates
March 31-April 4, 2025
Conference Location
Naples, IT

Keywords

testing AI, combinatorial coverage, combinatorial frequency

Citation

Lanus, E. , Lee, B. , Chandrasekaran, J. , Freeman, L. , Raunak, M. , Kacker, R. and Kuhn, D. (2025), Data Frequency Coverage Impact on AI Performance, Intl Workshop on Combinatorial Testing, Naples, IT, [online], https://doi.org/10.1109/ICSTW64639.2025.10962464, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=959549 (Accessed June 5, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created April 15, 2025, Updated June 3, 2025
Was this page helpful?