Copy the page URI to the clipboard
Berrar, Daniel and Dubitzky, Werner
(2018).
DOI: https://doi.org/10.1109/DSAA.2017.3
Abstract
Null hypothesis significance testing has become a mainstay in machine learning, with the p-value being firmly embedded in the current research practice. Significance testing is widely believed to lend scientific rigor to the interpretation of empirical findings; however, its serious problems have received scant attention in the machine learning literature so far. Here, we investigate one particular problem: the Jeffreys-Lindley paradox. This paradox describes a statistical conundrum where the frequentist and Bayesian interpretation are diametrically opposed. In four experiments using synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox has severe, real consequences for the current research practice. We caution that this practice might lead to a situation that is similar to the current reproducibility crisis in other fields of science. We offer for debate four avenues that might avert the looming crisis.