Copy the page URI to the clipboard
Taha, Abdel Aziz; Papariello, Luca; Alexandros, Bampoulidis; Knoth, Petr and Lupu, Mihai
(2022).
DOI: https://doi.org/10.1109/TKDE.2021.3068009
Abstract
Machine learning research, particularly in genomics, is often based on wide shaped datasets, i.e. datasets having a large number of features, but a small number of samples. Such configurations raise the possibility of chance influence (the increase of measured accuracy due to chance correlations) on the learning process and the evaluation results. Prior research underlined the problem of generalization of models obtained based on such data. In this paper, we investigate the influence of chance on prediction and show its significant effects on wide shaped datasets. First, we empirically demonstrate how significant the influence of chance in such datasets is by showing that prediction models trained on thousands of randomly generated datasets can achieve high accuracy. This is the case even when using cross-validation. We then provide a formal analysis of chance influence and design formal chance influence estimators based on the dataset parameters, namely its sample size, the number of features, the number of classes and the class distribution. Finally, we provide an in-depth discussion of the formal analysis including applications of the findings and recommendations on chance influence mitigation.
Viewing alternatives
Download history
Metrics
Public Attention
Altmetrics from AltmetricNumber of Citations
Citations from DimensionsItem Actions
Export
About
- Item ORO ID
- 75770
- Item Type
- Journal Item
- ISSN
- 1558-2191
- Project Funding Details
-
Funded Project Name Project ID Funding Body Not Set 10.13039/501100004955 sterreichische Forschungsfrderungsgesellschaft Not Set 10.13039/100010661 Horizon 2020 Framework Programme - Keywords
- High-dimensional data; chance correlation; formal estimation of chance; generalization; sparse data; genomics
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Knowledge Media Institute (KMi)
Faculty of Science, Technology, Engineering and Mathematics (STEM) - Research Group
- Big Scientific Data and Text Analytics Group (BSDTAG)
- Copyright Holders
- © 2021 IEEE
- Depositing User
- Petr Knoth