Copy the page URI to the clipboard
Berrar, Daniel
(2022).
DOI: https://doi.org/10.1007/s10618-022-00828-1
Abstract
The statistical comparison of machine learning classifiers is frequently underpinned by null hypothesis significance testing. Here, we provide a survey and analysis of underrated problems that significance testing entails for classification benchmark studies. The p-value has become deeply entrenched in machine learning, but it is substantially less objective and less informative than commonly assumed. Even very small p-values can drastically overstate the evidence against the null hypothesis. Moreover, the p-value depends on the experimenter’s intentions, irrespective of whether these were actually realized or not. We show how such intentions can lead to experimental designs with more than one stage, and how to calculate a valid p-value for such designs. We discuss two widely used statistical tests for the comparison of classifiers, the Friedman test and the Wilcoxon signed rank test. Some improvements to the use of p-values, such as the calibration with the Bayes factor bound, and alternative methods for the evaluation of benchmark studies are discussed as well.