Standard Errors and Significance Testing in Data Analysis for Testing Classifiers
Jin Chu Wu, Raghu N. Kacker
The one-classifier and two-classifier significance testing for evaluation and comparison of classifiers are conducted to investigate the statistical significance of differences and provide quantitative information in terms of the significance level, i.e., p-value, in a new ROC analysis where three score distributions and two decision thresholds are employed, and data dependency caused by multiple use of the same subjects is involved. To analyze the performance of classifiers, the standard error of the cost function is estimated using the nonparametric three-sample two-layer bootstrap algorithm on a two-layer data structure constructed after dataset optimization, based on our prior rigorous statistical research in ROC analysis on large datasets with data dependency. In comparison, the positive correlation coefficient must be taken into consideration, which is computed using a synchronized resampling algorithm; otherwise, the likelihood of detecting the statistical significance of difference between the performance levels of two classifiers can be wrongly reduced.
and Kacker, R.
Standard Errors and Significance Testing in Data Analysis for Testing Classifiers, NIST Interagency/Internal Report (NISTIR), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.IR.8383, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=932649
(Accessed December 8, 2023)