Copy the page URI to the clipboard
De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul
(2007).
URL: http://www.benjamins.com/cgi-bin/t_bookview.cgi?bo...
Abstract
We have known for some time that content words have "bursty" distributions in text (eg Church 00). In contrast, much of the literature assumes that function words are uninformative because they distribute homogeneously (eg Katz 96). In this paper based on two sets of experiments, we show that assumptions of homogeneity do not hold, even for the distrib- ution of extremely frequent function words. In the first experiment, we investigate the behav- iour of very frequent function words in the TIPSTER collection by postulating a "homogeneity assumption", which we then defeat in a series of experiments based on the χ2 test. Results show that it is statistically unreasonable to assume homogeneous term distributions within a corpus. We also found that document collec- tions are not neutral with respect to the property of homogeneity, even for very frequent function words. In the second set of experiment, we model the gaps between successive occurrences of a particular term using a mixture of exponential distributions. Based on the "homogeneity assumption" these gaps should be uniformly distributed across the entire corpus. But, using the model we demonstrate that gaps are not uniformly distributed, and even very frequent terms do occur in bursts. Since the homogeneity assumption was defeated resoundingly for diverse collections, we propose that these homogeneity measures and the re-occurrence model are suitable candidates for corpus profiling.