De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul
(2007).
URL:  http://www.benjamins.com/cgibin/t_bookview.cgi?bo... 

Google Scholar:  Look up in Google Scholar 
Abstract
We have known for some time that content words have "bursty" distributions in text (eg Church 00). In contrast, much of the literature assumes that function words are uninformative because they distribute homogeneously (eg Katz 96). In this paper based on two sets of experiments, we show that assumptions of homogeneity do not hold, even for the distrib ution of extremely frequent function words. In the first experiment, we investigate the behav iour of very frequent function words in the TIPSTER collection by postulating a "homogeneity assumption", which we then defeat in a series of experiments based on the χ2 test. Results show that it is statistically unreasonable to assume homogeneous term distributions within a corpus. We also found that document collec tions are not neutral with respect to the property of homogeneity, even for very frequent function words. In the second set of experiment, we model the gaps between successive occurrences of a particular term using a mixture of exponential distributions. Based on the "homogeneity assumption" these gaps should be uniformly distributed across the entire corpus. But, using the model we demonstrate that gaps are not uniformly distributed, and even very frequent terms do occur in bursts. Since the homogeneity assumption was defeated resoundingly for diverse collections, we propose that these homogeneity measures and the reoccurrence model are suitable candidates for corpus profiling.
Item Type:  Book Chapter 

Copyright Holders:  2007 John Benjamins B.V. 
ISBN:  9027248079, 9789027248077 
Keywords:  computational linguistics 
Academic Unit/Department:  Other Departments > ViceChancellor's Office Other Departments Mathematics, Computing and Technology > Computing & Communications Mathematics, Computing and Technology Mathematics, Computing and Technology > Mathematics and Statistics 
Interdisciplinary Research Centre:  Centre for Research in Computing (CRC) 
Related URLs: 

Item ID:  22566 
Depositing User:  Sarah Frain 
Date Deposited:  07 Sep 2010 09:41 
Last Modified:  15 Jan 2016 14:03 
URI:  http://oro.open.ac.uk/id/eprint/22566 
Share this page: 