The Open UniversitySkip to content
 

Defeating the homogeneity assumption

De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul H. (2004). Defeating the homogeneity assumption. In: JADT 2004 : 7es Journées internationales d’Analyse statistique des Données Textuelles, 10-12 March 2004, Louvain La Neuve, Belgium, UCL Presses Universitaries, pp. 282–294.

URL: http://www.cavi.univ-paris3.fr/lexicometrica/jadt/...
Google Scholar: Look up in Google Scholar

Abstract

The statistical NLP and IR literatures tend to make a “homogeneity assumption” about the distribution of terms, either by adopting a “bag of words” model, or in their treatment of function words. In this paper we develop a notion of homogeneity detection to a level of statistical significance, and conduct a series of experiments on different datasets, to show that the homogeneity assumption does not generally hold. We show that it also does not hold for function words. Importantly, datasets and document collections are found not to be neutral with respect to the property of homogeneity, even for function words. The homogeneity assumption is defeated substantially even for collections known to contain similar documents, and more drastically for diverse collections. We conclude that it is statistically unreasonable to assume that word distribution within a corpus is homogeneous. Because homogeneity findings differ substantially between different collections, we argue for the use of homogeneity measures as a means of profiling datasets.

Item Type: Conference Item
Copyright Holders: 2004 The Authors
Keywords: homogeneity; term distribution; corpus profiling
Academic Unit/Department: Mathematics, Computing and Technology
Mathematics, Computing and Technology > Computing & Communications
Mathematics, Computing and Technology > Mathematics and Statistics
Interdisciplinary Research Centre: Centre for Research in Computing (CRC)
Item ID: 22604
Depositing User: Sarah Frain
Date Deposited: 17 Aug 2010 11:36
Last Modified: 02 Dec 2010 21:01
URI: http://oro.open.ac.uk/id/eprint/22604
Share this page:

Actions (login may be required)

View Item
Report issue / request change

Policies | Disclaimer

© The Open University   + 44 (0)870 333 4340   general-enquiries@open.ac.uk