Copy the page URI to the clipboard
De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul H.
(2004).
URL: http://www.cavi.univ-paris3.fr/lexicometrica/jadt/...
Abstract
The statistical NLP and IR literatures tend to make a “homogeneity assumption” about the distribution of terms, either by adopting a “bag of words” model, or in their treatment of function words. In this paper we develop a notion of homogeneity detection to a level of statistical significance, and conduct a series of experiments on different datasets, to show that the homogeneity assumption does not generally hold. We show that it also does not hold for function words. Importantly, datasets and document collections are found not to be neutral with respect to the property of homogeneity, even for function words. The homogeneity assumption is defeated substantially even for collections known to contain similar documents, and more drastically for diverse collections. We conclude that it is statistically unreasonable to assume that word distribution within a corpus is homogeneous. Because homogeneity findings differ substantially between different collections, we argue for the use of homogeneity measures as a means of profiling datasets.
Viewing alternatives
Item Actions
Export
About
- Item ORO ID
- 22604
- Item Type
- Conference or Workshop Item
- Keywords
- homogeneity; term distribution; corpus profiling
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Mathematics and Statistics - Research Group
- Centre for Research in Computing (CRC)
- Copyright Holders
- © 2004 The Authors
- Depositing User
- Sarah Frain