Data Frequency Coverage Impact on AI Performance

Erin Lanus; Brian Lee; Jaganmohan Chandrasekaran; Laura Freeman; M S Raunak; Raghu Kacker; David Kuhn

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Data Frequency Coverage Impact on AI Performance

Published

April 15, 2025

Author(s)

Erin Lanus, Brian Lee, Jaganmohan Chandrasekaran, Laura Freeman, M S Raunak, Raghu Kacker, David Kuhn

Abstract

Artificial Intelligence (AI) models use statistical learning over data to solve complex problems for which straightforward rules or algorithms may be difficult or impossible to design; however, a side effect is that models that are complex enough to sufficiently represent the function may be uninterpretable. Combinatorial testing, a black-box approach arising from software testing, has been applied to test AI models. A key differentiator between traditional software and AI is that many traditional software faults are deterministic, requiring a failure-inducing combination of inputs to appear only once in the test set for it to be discovered. On the other hand, AI models learn statistically by reinforcing weights through repeated appearances in the training dataset, and the frequency of input combinations plays a significant role in influencing the model's behavior. Thus, a single occurrence of a combination of feature values may not be sufficient to influence the model's behavior. Consequently, measures like simple combinatorial coverage that are applicable to software testing do not capture the frequency with which interactions are covered in the AI model's input space. This work develops methods to characterize the data frequency coverage of feature interactions in training datasets and analyze the impact of imbalance, or skew, in the combinatorial frequency coverage of the training data on model performance. We demonstrate our methods with experiments on an open-source dataset using several classical machine learning algorithms. This pilot study makes three observations: performance may increase or decrease with data skew, feature importance methods do not predict skew impact, and adding more data may not mitigate skew effects.

Proceedings Title

Intl Workshop on Combinatorial Testing

Conference Dates

March 31-April 4, 2025

Conference Location

Naples, IT

Pub Type

Conferences

Download Paper

https://doi.org/10.1109/ICSTW64639.2025.10962464

Local Download

Keywords

testing AI, combinatorial coverage, combinatorial frequency

Information technology

Citation

Lanus, E. , Lee, B. , Chandrasekaran, J. , Freeman, L. , Raunak, M. , Kacker, R. and Kuhn, D. (2025), Data Frequency Coverage Impact on AI Performance, Intl Workshop on Combinatorial Testing, Naples, IT, [online], https://doi.org/10.1109/ICSTW64639.2025.10962464, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=959549 (Accessed July 17, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created April 15, 2025, Updated June 3, 2025

Was this page helpful?

Data Frequency Coverage Impact on AI Performance

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues