Copy the page URI to the clipboard
Sarkar, Avik; Garthwaite, Paul and De Roeck, Anne
(2005).
DOI: https://doi.org/10.3115/1706543.1706552
Abstract
This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and within document burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus.