The Open UniversitySkip to content
 

Even very frequent function words do not distribute homogeneously

De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul (2007). Even very frequent function words do not distribute homogeneously. In: Nicolov, Nicolas; Bontcheva, Kalina; Angelova, Galia and Mitkov, Ruslan eds. Recent Advances in Natural Language Processing IV. Current Issues in Linguistic Theory (292). Amsterdam: John Benjamins, pp. 267–276.

URL: http://www.benjamins.com/cgi-bin/t_bookview.cgi?bo...
Google Scholar: Look up in Google Scholar

Abstract

We have known for some time that content words have "bursty" distributions in text (eg Church 00). In contrast, much of the literature assumes that function words are uninformative because they distribute homogeneously (eg Katz 96). In this paper based on two sets of experiments, we show that assumptions of homogeneity do not hold, even for the distrib- ution of extremely frequent function words. In the first experiment, we investigate the behav- iour of very frequent function words in the TIPSTER collection by postulating a "homogeneity assumption", which we then defeat in a series of experiments based on the χ2 test. Results show that it is statistically unreasonable to assume homogeneous term distributions within a corpus. We also found that document collec- tions are not neutral with respect to the property of homogeneity, even for very frequent function words. In the second set of experiment, we model the gaps between successive occurrences of a particular term using a mixture of exponential distributions. Based on the "homogeneity assumption" these gaps should be uniformly distributed across the entire corpus. But, using the model we demonstrate that gaps are not uniformly distributed, and even very frequent terms do occur in bursts. Since the homogeneity assumption was defeated resoundingly for diverse collections, we propose that these homogeneity measures and the re-occurrence model are suitable candidates for corpus profiling.

Item Type: Book Chapter
Copyright Holders: 2007 John Benjamins B.V.
ISBN: 90-272-4807-9, 978-90-272-4807-7
Keywords: computational linguistics
Academic Unit/Department: Mathematics, Computing and Technology
Mathematics, Computing and Technology > Computing & Communications
Mathematics, Computing and Technology > Mathematics and Statistics
Interdisciplinary Research Centre: Centre for Research in Computing (CRC)
Related URLs:
Item ID: 22566
Depositing User: Sarah Frain
Date Deposited: 07 Sep 2010 09:41
Last Modified: 02 Dec 2010 21:01
URI: http://oro.open.ac.uk/id/eprint/22566
Share this page:

Actions (login may be required)

View Item
Report issue / request change

Policies | Disclaimer

© The Open University   + 44 (0)870 333 4340   general-enquiries@open.ac.uk