Using p-values for the comparison of classifiers: pitfalls and alternatives

Berrar, Daniel (2022). Using p-values for the comparison of classifiers: pitfalls and alternatives. Data Mining and Knowledge Discovery, 36(3) pp. 1102–1139.



The statistical comparison of machine learning classifiers is frequently underpinned by null hypothesis significance testing. Here, we provide a survey and analysis of underrated problems that significance testing entails for classification benchmark studies. The p-value has become deeply entrenched in machine learning, but it is substantially less objective and less informative than commonly assumed. Even very small p-values can drastically overstate the evidence against the null hypothesis. Moreover, the p-value depends on the experimenter’s intentions, irrespective of whether these were actually realized or not. We show how such intentions can lead to experimental designs with more than one stage, and how to calculate a valid p-value for such designs. We discuss two widely used statistical tests for the comparison of classifiers, the Friedman test and the Wilcoxon signed rank test. Some improvements to the use of p-values, such as the calibration with the Bayes factor bound, and alternative methods for the evaluation of benchmark studies are discussed as well.

Viewing alternatives


Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions
No digital document available to download for this item

Item Actions