Author(s)
Jin Chu Wu, Raghu N. Kacker
Abstract
The one-classifier and two-classifier significance testing for evaluation and comparison of classifiers are conducted to investigate the statistical significance of differences and provide quantitative information in terms of the significance level, i.e., p-value, in a new ROC analysis where three score distributions and two decision thresholds are employed, and data dependency caused by multiple use of the same subjects is involved. To analyze the performance of classifiers, the standard error of the cost function is estimated using the nonparametric three-sample two-layer bootstrap algorithm on a two-layer data structure constructed after dataset optimization, based on our prior rigorous statistical research in ROC analysis on large datasets with data dependency. In comparison, the positive correlation coefficient must be taken into consideration, which is computed using a synchronized resampling algorithm; otherwise, the likelihood of detecting the statistical significance of difference between the performance levels of two classifiers can be wrongly reduced.
Citation
NIST Interagency/Internal Report (NISTIR) - 8383
Keywords
ROC analysis, Data dependency, Standard error, Bootstrap, Statistical significance, Significance testing.
Citation
Wu, J.
and Kacker, R.
(2021),
Standard Errors and Significance Testing in Data Analysis for Testing Classifiers, NIST Interagency/Internal Report (NISTIR), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.IR.8383, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=932649 (Accessed May 9, 2026)
Additional citation formats
Issues
If you have any questions about this publication or are having problems accessing it, please contact [email protected].